Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Geopixel: Pixel grounding large multimodal model in remote sensing
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 8roles
baseline 1polarities
baseline 1representative citing papers
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
GeoMeld provides a large-scale aligned multimodal remote sensing dataset with verified semantic captions and a joint pretraining method that improves downstream transfer and cross-sensor robustness in foundation models.
RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
ProtoFlow stabilizes class prototypes as low-curvature trajectories in a temporal vector field to mitigate forgetting and improve mIoU in class- and domain-incremental remote sensing segmentation.
citing papers explorer
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
-
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
-
GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing
GeoMeld provides a large-scale aligned multimodal remote sensing dataset with verified semantic captions and a joint pretraining method that improves downstream transfer and cross-sensor robustness in foundation models.
-
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
-
WOW-Seg: A Word-free Open World Segmentation Model
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
-
ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow
ProtoFlow stabilizes class prototypes as low-curvature trajectories in a temporal vector field to mitigate forgetting and improve mIoU in class- and domain-incremental remote sensing segmentation.
- B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation