When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Huaxiu Yao; Jaehong Yoon; Mingyu Ding; Mohit Bansal; Shoubin Yu; Yue Zhang; Zun Wang

arxiv: 2602.08236 · v2 · pith:3MDJXGPHnew · submitted 2026-02-09 · 💻 cs.CV · cs.AI· cs.CL

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Shoubin Yu , Yue Zhang , Zun Wang , Jaehong Yoon , Huaxiu Yao , Mingyu Ding , Mohit Bansal This is my paper

classification 💻 cs.CV cs.AIcs.CL

keywords imaginationreasoningvisualwhenspatialtest-timeworldevidence

0 comments

read the original abstract

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
cs.CV 2026-05 unverdicted novelty 7.0

Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

View Dropout forces reliance on intermediate thinking images in unified multimodal models, with panoramic renderings proving most effective for out-of-domain cross-view spatial reasoning.
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
cs.AI 2026-04 unverdicted novelty 7.0

WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while makin...
Einstein World Models
cs.AI 2026-06 unverdicted novelty 5.0

Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.