pith. machine review for the scientific record. sign in

arxiv: 2605.11856 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.CL

Recognition: no theorem link

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:28 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords visual latent reasoningmultimodal LLMschain-of-thoughtunified vision-languagetoken efficiencyvisual thinkingrendering to latents
0
0 comments X

The pith

UniVLR unifies text reasoning traces with images into compact visual latent tokens that multimodal models use directly for answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes UniVLR to fix the inefficiency of interleaving explicit text chain-of-thought with visual tokens in multimodal large language models. It renders textual reasoning together with auxiliary images into one shared visual workspace and compresses the result into visual latent tokens. At inference the model thinks only in these latents and decodes the final answer without separate text output or tool calls. Experiments on perception and visual reasoning tasks show higher accuracy with substantially fewer generated tokens than prior visual latent methods.

Core claim

UniVLR treats textual reasoning and auxiliary visual evidence as a shared visual workspace by rendering reasoning traces together with auxiliary images and learning to compress this unified representation into compact visual latent tokens. At inference time the model reasons only through the visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning.

What carries the argument

The UniVLR rendering-and-compression step that merges textual traces with auxiliary images into a single visual workspace before encoding to latent tokens.

If this is right

  • Inference runs with far fewer generated tokens because no separate text chain-of-thought path is produced.
  • The model can answer directly from visual latents, removing the need for external tool calls at test time.
  • Performance improves on real-world perception and visual reasoning benchmarks relative to prior interleaved visual latent methods.
  • Reasoning becomes a single-channel visual process rather than a fragmented text-plus-vision process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compression idea could be tested on non-image modalities by first converting them into visual form.
  • If the latent tokens can be decoded back to readable traces, the approach might offer a new route to inspect reasoning steps.
  • Scaling the shared visual workspace to longer reasoning sequences would test whether token savings remain linear.

Load-bearing premise

Compressing the rendered textual reasoning traces together with auxiliary images into visual latent tokens preserves all necessary reasoning information without loss compared to explicit text chain-of-thought.

What would settle it

A direct side-by-side test on the same visual reasoning problems where the model using only the compressed visual latents produces wrong answers while the version with explicit text chain-of-thought produces correct answers.

Figures

Figures reproduced from arXiv: 2605.11856 by Chen Gao, Houcheng Jiang, Jiajun Fu, Junfeng Fang, Xiangnan He, Xiang Wang, Yong Li.

Figure 1
Figure 1. Figure 1: Comparison between the current methods and UniVLR. (a) Paradigm illustration com￾paring the interleaved design of existing visual latent reasoning methods with UniVLR. (b) Attention visualization showing that UniVLR induces denser and more focused attention to auxiliary images than the interleaved design. (c) Accuracy–efficiency comparison on visual reasoning benchmarks, where UniVLR achieves higher averag… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed UniVLR. Left: Unified Visual Canvas Rendering. Textual reasoning traces are rendered and composed with auxiliary visual evidence into a shared canvas, enabling both reasoning semantics and visual evidence to be encoded by the same vision encoder. Right: Unified Visual Latent Alignment. The canvas embeddings are compressed into fixed-length latent targets, and the MLLM learns to aut… view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning-token efficiency and latent representation distribution. (a) UniVLR performs reasoning with a compact latent-token budget and no generated text CoT. (b) Last-hidden-state visualization shows that UniVLR latent tokens are closer to image-token representations than to text-token representations [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Causal perturbation analysis of latent reasoning tokens. We perturb latent hidden states during inference by zeroing them, injecting Gaussian noise, or repeating the first latent state. The resulting accuracy drops indicate that latent tokens carry task-relevant reasoning information. • Obs 4: UniVLR latent tokens align more closely with visual representations than textual ones. In [PITH_FULL_IMAGE:figure… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the number of unified visual latent tokens during inference on test accuracy. Both UniVLR-Stage1 and UniVLR are trained with a fixed teacher-forcing latent size of K = 24. The dashed line marks the zero-shot accuracy of Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of the Vertical Layout Strategy. Generated by Algorithm 1, this layout dynamically extends the canvas height to accommodate a long sequence of reasoning steps [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of the Compact Left-Right Layout Strategy. Generated by Algorithm 2, this layout partitions the canvas into two columns. The left panel houses the highlighted auxiliary image (serving as the visual grounding anchor), while the right panel vertically stacks the text-only reasoning cards. Directional flow arrows bridge the two columns, visually encoding the multimodal dependencies (e.g., text → joint… view at source ↗
Figure 8
Figure 8. Figure 8: Example of the Fixed-Canvas Adaptive Wrap Strategy. Generated by Algorithm 3, the canvas dimensions are strictly bounded 1024 × 1024. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces UniVLR, a unified visual latent reasoning framework for multimodal LLMs. It renders textual chain-of-thought traces together with auxiliary images into a shared visual workspace, compresses the result into compact visual latent tokens via a learned encoder, and performs all inference solely through these latents before directly decoding the final answer. This avoids interleaved text reasoning and external tools at inference time. Experiments on perception and visual reasoning benchmarks reportedly show higher accuracy than prior visual latent methods while generating substantially fewer reasoning tokens.

Significance. If the central claims hold under rigorous validation, UniVLR would offer a meaningful step toward more efficient and coherent visual thinking in MLLMs by collapsing fragmented text-vision channels into a single latent workspace. The reported token reductions could improve scalability for deployment, and the avoidance of explicit CoT at inference addresses a practical bottleneck in current interleaved approaches. However, the significance is tempered by the absence of detailed derivations or ablations confirming information preservation through lossy visual compression.

major comments (3)
  1. [§3] §3 (Method): The core assumption that rendering text CoT traces with auxiliary images and compressing them into visual latents preserves all necessary reasoning information is load-bearing for both accuracy and efficiency claims, yet the manuscript provides no ablation comparing retained symbolic precision against explicit text retention or against non-compressed baselines; visual encoders are lossy by design, raising the risk that fine-grained logical steps are discarded and cannot be recovered at inference.
  2. [§4] §4 (Experiments): Reported gains in accuracy and token reduction over prior interleaved visual latent methods lack error bars, multiple random seeds, or statistical significance tests; without these, it is unclear whether the improvements are robust or could be explained by differences in training data, model scale, or prompt engineering rather than the unified workspace itself.
  3. [§3.2] §3.2 (Compression module): The description of the shared visual workspace and latent compression does not include an analysis of information-theoretic bounds or reconstruction fidelity metrics (e.g., how much of the original text CoT can be recovered from the latents); this omission leaves the efficiency-accuracy tradeoff unquantified.
minor comments (3)
  1. [§3.1] Notation for the visual latent tokens is introduced without a clear equation or diagram showing the exact compression pipeline from rendered image to latent vector.
  2. [Figure 2] Figure 2 (architecture overview) would benefit from explicit callouts distinguishing the training-time rendering step from the inference-time latent-only path.
  3. The abstract claims 'substantially fewer generated reasoning tokens' but the main text does not define the exact token-counting protocol (e.g., whether it includes the initial prompt or only the generated latents).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our design choices and empirical support, while outlining targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The core assumption that rendering text CoT traces with auxiliary images and compressing them into visual latents preserves all necessary reasoning information is load-bearing for both accuracy and efficiency claims, yet the manuscript provides no ablation comparing retained symbolic precision against explicit text retention or against non-compressed baselines; visual encoders are lossy by design, raising the risk that fine-grained logical steps are discarded and cannot be recovered at inference.

    Authors: We acknowledge that visual encoders introduce loss by design. However, UniVLR optimizes the compression end-to-end specifically for reasoning utility rather than pixel-level fidelity, and the consistent accuracy gains over interleaved baselines on perception and visual reasoning tasks indicate that task-critical information is retained. To directly address the request for ablations, we will add a new experiment in the revised §3 comparing against a non-compressed baseline that retains explicit text CoT alongside visual tokens. revision: yes

  2. Referee: [§4] §4 (Experiments): Reported gains in accuracy and token reduction over prior interleaved visual latent methods lack error bars, multiple random seeds, or statistical significance tests; without these, it is unclear whether the improvements are robust or could be explained by differences in training data, model scale, or prompt engineering rather than the unified workspace itself.

    Authors: We appreciate the emphasis on statistical rigor. The original results were reported from single runs due to compute constraints, but we have now run additional trials with varied random seeds. In the revised manuscript we will report means with standard deviations, error bars, and p-values from significance tests to confirm that the observed improvements are attributable to the unified workspace. revision: yes

  3. Referee: [§3.2] §3.2 (Compression module): The description of the shared visual workspace and latent compression does not include an analysis of information-theoretic bounds or reconstruction fidelity metrics (e.g., how much of the original text CoT can be recovered from the latents); this omission leaves the efficiency-accuracy tradeoff unquantified.

    Authors: We agree that explicit information-theoretic bounds would be informative. Because the latents are trained for reasoning rather than invertible reconstruction, direct text recovery metrics are not the primary objective. We will expand §3.2 with a quantitative discussion of the observed efficiency-accuracy tradeoff (token reduction versus accuracy) and include any feasible reconstruction experiments in the revision; a full theoretical bound analysis is left for future work as it is outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical comparisons without derivations or self-referential reductions

full rationale

The paper introduces UniVLR as a framework that renders textual reasoning traces with auxiliary images and compresses them into visual latent tokens for inference-time reasoning without explicit text CoT. No equations, derivations, fitted parameters, or uniqueness theorems appear in the provided text. Central claims of improved accuracy and token efficiency are justified solely by experimental results on perception and visual reasoning tasks, which are independent of any self-citation chain or input-by-construction reduction. The approach is self-contained as an architectural proposal validated externally, yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view reveals no explicit free parameters, axioms, or invented entities; the framework is presented at a conceptual level without mathematical formulation or fitting details.

pith-pipeline@v0.9.0 · 5479 in / 972 out tokens · 32567 ms · 2026-05-13T07:28:18.054575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 12 internal anchors

  1. [1]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xiang, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, and Yi R. (May) Fung. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.CoRR, abs/2506.23918, 2025

  2. [2]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InCVPR, pages 13084–13094. IEEE, 2024

  3. [3]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InAAAI, pages 7907–7915. AAAI Press, 2025

  4. [4]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR. OpenReview.net, 2024

  5. [5]

    Mme-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InICLR

    Yifan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. Mme-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InICLR. OpenReview.net, 2025

  6. [6]

    Smith, and Ranjay Krishna

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024

  7. [7]

    Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

    Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.CoRR, abs/2505.19255, 2025

  8. [8]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning.CoRR, abs/2505.14362, 2025

  9. [9]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.CoRR, abs/2505.15966, 2025

  10. [10]

    Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.CoRR, abs/2506.17218, 2025

  11. [11]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.CoRR, abs/2509.24251, 2025

  12. [12]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.CoRR, abs/2511.21395, 2025

  13. [13]

    Sketch-in-latents: Eliciting unified reasoning in mllms.CoRR, abs/2512.16584, 2025

    Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, and Ruixuan Li. Sketch-in-latents: Eliciting unified reasoning in mllms.CoRR, abs/2512.16584, 2025

  14. [14]

    Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, and Xin Eric Wang. Reasoning within the mind: Dynamic multimodal interleaving in latent space.CoRR, abs/2512.12623, 2025

  15. [15]

    Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

    Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and Xudong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.CoRR, abs/2511.19418, 2025. 10

  16. [16]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuan- dong Tian. Training large language models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024

  17. [17]

    Thought-basedAttentionMask Construction

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.CoRR, abs/2505.15778, 2025

  18. [18]

    The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

  19. [19]

    Visualizing thought

    Barbara Tversky. Visualizing thought. InHandbook of human centric visualization, pages 3–40. Springer, 2013

  20. [20]

    MIT press, 1995

    Vinod Goel.Sketches of thought. MIT press, 1995

  21. [21]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.CoRR, abs/2510.18234, 2025

  22. [22]

    Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.CoRR, abs/2601.20552, 2026

  23. [23]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...

  24. [24]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report.CoRR, abs/2511.21631, 2025

  25. [25]

    Onelatent: Single-token compression for visual latent reasoning.CoRR, abs/2602.13738, 2026

    Bo Lv, Yasheng Sun, Junjie Wang, and Haoxiang Shi. Onelatent: Single-token compression for visual latent reasoning.CoRR, abs/2602.13738, 2026

  26. [26]

    Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

    Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of- thought: Rendering textual chain-of-thought as images for visual latent reasoning.CoRR, abs/2601.14750, 2026

  27. [27]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    Token fusion: Bridging the gap between token pruning and token merging

    Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024

  29. [29]

    Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

  30. [30]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. 2025

  31. [31]

    Tamp: Token-adaptive layerwise pruning in multimodal large language models

    Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polisetty, Yi R Fung, and Paul Pu Liang. Tamp: Token-adaptive layerwise pruning in multimodal large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6892–6908, 2025

  32. [32]

    Your large vision-language model only needs a few attention heads for visual grounding

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. pages 9339–9350, 2025

  33. [33]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, 2024. 11

  34. [34]

    Zebra-cot: A dataset for interleaved vision language reasoning

    Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, et al. Zebra-cot: A dataset for interleaved vision language reasoning. arXiv preprint arXiv:2507.16746, 2025

  35. [35]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  36. [36]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  37. [37]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  38. [38]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

  39. [39]

    visual thoughts,

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.CoRR, abs/1506.03099, 2015. 12 A Broader Impacts UniVLR explores a more efficient way for MLLMs to conduct visual reasoning by replacing verbose explicit reasoning traces with compact visual latent tokens. This may reduce...

  40. [40]

    As outlined in Algorithm 1, the canvas width is constrained by a minimum threshold and the scaled auxiliary image

    Vertical Layout Strategy.This strategy is utilized to render lengthy reasoning traces where components naturally follow a sequential logic. As outlined in Algorithm 1, the canvas width is constrained by a minimum threshold and the scaled auxiliary image. Each textual or visual reasoning block is rendered into a rounded card, and the final canvas height is...

  41. [41]

    As detailed in Algorithm 2, the canvas has a fixed width but dynamic height

    Compact Left-Right Layout Strategy.For reasoning patterns involving a central visual grounding step surrounded by textual analysis (e.g., text → joint-image → text), we employ a compact left-right architecture. As detailed in Algorithm 2, the canvas has a fixed width but dynamic height. The left panel is dedicated to the highlighted auxiliary image and it...

  42. [42]

    As described in Algorithm 3, the canvas dimensions are rigidly constrained (e.g., 1024×1024 )

    Fixed-Canvas Adaptive Wrap Strategy.To ensure strict resolution control during certain experimental settings, we design a fixed-canvas strategy. As described in Algorithm 3, the canvas dimensions are rigidly constrained (e.g., 1024×1024 ). The auxiliary image is scaled and anchored to the bottom-left corner. The textual reasoning is then rendered using an...