Timage generates text query overlays on images via Constrained Schrödinger Bridge to boost fine-grained spatial reasoning in vision-language models, outperforming larger systems on VMCBench with a 7B backbone.
arXiv preprint arXiv:2411.14863 (2024)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models
Timage generates text query overlays on images via Constrained Schrödinger Bridge to boost fine-grained spatial reasoning in vision-language models, outperforming larger systems on VMCBench with a 7B backbone.