Geopixel: Pixel grounding large multimodal model in remote sensing

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, Salman Khan · 2025 · arXiv 2501.13925

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Evaluating Remote Sensing Image Captions Beyond Metric Biases

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.

GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

GeoMeld provides a large-scale aligned multimodal remote sensing dataset with verified semantic captions and a joint pretraining method that improves downstream transfer and cross-sensor robustness in foundation models.

RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

cs.CV · 2025-12-19 · conditional · novelty 7.0

MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.

WOW-Seg: A Word-free Open World Segmentation Model

cs.CV · 2026-05-16 · conditional · novelty 6.0

WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.

ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

cs.CV · 2026-04-03 · unverdicted · novelty 5.0 · 2 refs

ProtoFlow stabilizes class prototypes as low-curvature trajectories in a temporal vector field to mitigate forgetting and improve mIoU in class- and domain-incremental remote sensing segmentation.

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

cs.CV · 2026-05-22

citing papers explorer

Showing 8 of 8 citing papers.

Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 45
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation cs.CV · 2026-04-17 · unverdicted · none · ref 38
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing cs.CV · 2026-04-12 · unverdicted · none · ref 18
GeoMeld provides a large-scale aligned multimodal remote sensing dataset with verified semantic captions and a joint pretraining method that improves downstream transfer and cross-sensor robustness in foundation models.
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs cs.CV · 2026-04-09 · unverdicted · none · ref 46
RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding cs.CV · 2025-12-19 · conditional · none · ref 71
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
WOW-Seg: A Word-free Open World Segmentation Model cs.CV · 2026-05-16 · conditional · none · ref 18
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow cs.CV · 2026-04-03 · unverdicted · none · ref 5 · 2 links
ProtoFlow stabilizes class prototypes as low-curvature trajectories in a temporal vector field to mitigate forgetting and improve mIoU in class- and domain-incremental remote sensing segmentation.
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation cs.CV · 2026-05-22 · unreviewed · ref 51

Geopixel: Pixel grounding large multimodal model in remote sensing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer