Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.
Qwen3 technical report
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2representative citing papers
A vision-language grounded framework generates and evaluates synthetic remote sensing data, releasing ARAS400k where augmented training outperforms real-data baselines for segmentation and captioning.
citing papers explorer
-
Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization
Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.
-
Grounding Synthetic Data Generation With Vision and Language Models
A vision-language grounded framework generates and evaluates synthetic remote sensing data, releasing ARAS400k where augmented training outperforms real-data baselines for segmentation and captioning.