{"work":{"id":"d35f1e96-5f12-4e84-990b-e4b05852180e","openalex_id":null,"doi":null,"arxiv_id":"2506.17218","raw_key":null,"title":"Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens","authors":null,"authors_text":"Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan","year":2025,"venue":"cs.CV","abstract":"Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.","external_url":"https://arxiv.org/abs/2506.17218","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-21T21:44:22.813497+00:00","pith_arxiv_id":"2506.17218","created_at":"2026-05-09T23:44:44.368714+00:00","updated_at":"2026-05-21T21:44:22.813497+00:00","title_quality_ok":true,"display_title":"Machine mental imagery: Empower multimodal reasoning with latent visual tokens","render_title":"Machine mental imagery: Empower multimodal reasoning with latent visual tokens"},"hub":{"state":{"work_id":"d35f1e96-5f12-4e84-990b-e4b05852180e","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":20,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2025-09-27T04:36:12+00:00","last_pith_cited_at":"2026-05-19T04:29:33+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T12:37:43.234339+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":10}],"polarity_counts":[{"context_polarity":"background","n":10}],"runs":{},"summary":{},"graph":{},"authors":[]}}