Timage generates text query overlays on images via Constrained Schrödinger Bridge to boost fine-grained spatial reasoning in vision-language models, outperforming larger systems on VMCBench with a 7B backbone.
arXiv preprint arXiv:2508.01678 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
citing papers explorer
-
Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models
Timage generates text query overlays on images via Constrained Schrödinger Bridge to boost fine-grained spatial reasoning in vision-language models, outperforming larger systems on VMCBench with a 7B backbone.
- Dual-Anchoring: Addressing State Drift in Vision-Language Navigation