Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Geopixel: Pixel grounding large multimodal model in remote sensing
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 10roles
baseline 1polarities
baseline 1representative citing papers
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
GeoMeld provides a large-scale aligned multimodal remote sensing dataset with verified semantic captions and a joint pretraining method that improves downstream transfer and cross-sensor robustness in foundation models.
RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.
Introduces MTRS task, MTRefSeg-21K benchmark of 21K image-text-mask triplets, and MTRefSeg-R1 LVLM baseline that outperforms standard models via two-stage change-aware training.
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
B-GRTO pre-trains a segmentation tool via bootstrapped group relative optimization on GRPO rollouts, yielding substantial gains over plain GRPO on referring segmentation benchmarks.
citing papers explorer
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.