arXiv preprint arXiv:2411.18142 , year=

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models , author= · arXiv 2411.18142

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

SceneDiver introduces a coarse-to-fine focus plan generation approach for VLMs that constructs holistic scene graphs then iteratively decomposes tasks, plus a distillation adapter for VLAs, to reduce visual hallucinations in embodied AI benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation cs.CV · 2026-06-02 · unverdicted · none · ref 37
SceneDiver introduces a coarse-to-fine focus plan generation approach for VLMs that constructs holistic scene graphs then iteratively decomposes tasks, plus a distillation adapter for VLAs, to reduce visual hallucinations in embodied AI benchmarks.

arXiv preprint arXiv:2411.18142 , year=

fields

years

verdicts

representative citing papers

citing papers explorer