hub Canonical reference

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan · 2025 · cs.CV · arXiv 2506.17218

Canonical reference. 100% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 10

representative citing papers

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

DLR creates discrete latent tokens from rendered CoT images via clustering, enabling up to 20x compression and interpretable trajectories that outperform continuous latent baselines on reasoning tasks.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

cs.CV · 2025-12-14 · unverdicted · novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

cs.CV · 2026-05-09 · conditional · novelty 7.0 · 2 refs

CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

cs.CL · 2026-01-11 · unverdicted · novelty 7.0

Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

AMVL applies bidirectional KL calibration to align answer-agnostic prior with answer-conditioned posterior in variational multimodal reasoning, reducing leakage and yielding +10.83 average gain on BLINK benchmark.

Latent Noise Mask for Reducing Visual Redundancy in Multimodal Large Language Models

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

Lens purifies visual evidence in MLLMs via question-conditioned latent noise masking with a LET token, yielding 2.4-6.4 point gains on VQA and grounding tasks.

CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.

MUSE: A Unified Agentic Harness for MLLMs

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.

ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

ReGuLaR is a relation-grounded latent reasoning method for LVLMs that trains with a ReGFormer on a new RGROUNDING-351K dataset to achieve SOTA benchmark performance while running without the module at inference.

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

cs.AI · 2025-09-27 · unverdicted · novelty 6.0

LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

Incremental visual scaffolding using multimodal models improves persistent common ground representation in situated dialogue by reducing representational blur compared to text-only approaches, with hybrid text-visual yielding best results on the IndiRef benchmark.

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 40 · internal anchor
IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.
Artificial Phantasia: Emergent Mental Imagery in Large Language Models cs.AI · 2025-09-27 · unverdicted · none · ref 85 · internal anchor
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer