Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.