{"paper":{"title":"GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"GeoVista builds a global exploration plan then performs branch-wise inspections while tracking evidence to interpret ultra-high-resolution remote sensing images.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bo Yang, Haoran Liu, Jiasen Hu, Jiashun Zhu, Lang Sun, Nachuan Xing, Ronghao Fu, Weijie Zhang, Weipeng Zhang, Xiao Yang, Xu Na, Zhiheng Xue, Zhiwen Lin","submitted_at":"2026-05-14T07:15:46Z","abstract_excerpt":"Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that building a global exploration plan followed by branch-wise local inspection with explicit evidence state maintenance will reliably handle sparse tiny evidence across large scenes without losing context or causing duplication, which depends on the effectiveness of the APEX-GRO trajectory corpus and GRPO alignment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GeoVista builds a global exploration plan then performs branch-wise inspections while tracking evidence to interpret ultra-high-resolution remote sensing images.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3b34c75ca82ac71e0824dfee0d681e664e129c86ff5f7adcec8e2b298cf9745b"},"source":{"id":"2605.14475","kind":"arxiv","version":1},"verdict":{"id":"4f804091-a583-4336-b194-38c9f79821a8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:36:57.241165Z","strongest_claim":"Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance.","one_line_summary":"GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that building a global exploration plan followed by branch-wise local inspection with explicit evidence state maintenance will reliably handle sparse tiny evidence across large scenes without losing context or causing duplication, which depends on the effectiveness of the APEX-GRO trajectory corpus and GRPO alignment.","pith_extraction_headline":"GeoVista builds a global exploration plan then performs branch-wise inspections while tracking evidence to interpret ultra-high-resolution remote sensing images."},"references":{"count":94,"sample":[{"doi":"","year":2023,"title":"Towards large-scale small object detection: Survey and benchmarks.IEEE transactions on pattern analysis and machine intelligence, 45(11):13467–13488, 2023","work_id":"56a8f76b-c3bb-4f96-b5de-91e684ceb560","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Star: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery.IEEE Trans","work_id":"a8e1273e-f8dd-4bf9-8265-cd434c2ad827","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning.ArXiv, abs/2503.07588, 2025","work_id":"11ebcf82-8db3-4c59-82f4-8e71cdf69f00","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery","work_id":"0d316c52-7eb2-4906-8b3c-d6ff2c4d5135","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution","work_id":"e8b770f4-ab0f-4fa1-9b6e-f9bd0e4e8960","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":94,"snapshot_sha256":"b1a713a072d529affbb993cbb88539381ca1615e03cbe34a46fcd7619837daa0","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ce2454ff5b095b8320453b2c5f64a18930c18de3f06d660902bc1bea73f48e4f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}