pith. sign in

arxiv: 2511.02360 · v4 · pith:55TB23NOnew · submitted 2025-11-04 · 💻 cs.CV · cs.CL

LaRe: Latent Refocusing for Multimodal Reasoning

classification 💻 cs.CV cs.CL
keywords refocusinglatentreasoninglaremultimodalvisualparadigmachieves
0
0 comments X
read the original abstract

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

    cs.CL 2026-01 unverdicted novelty 7.0

    Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

  2. Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

  3. The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

    cs.CV 2026-04 unverdicted novelty 6.0

    A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.