arxiv: 2605.10769 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Ziyi Wang , Xianping Ma , Ziyao Wang , Hongyang Zhang , Man On Pun

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords remote sensingsemantic segmentationmultimodal large language modelsmix of expertsscene captioningvisual feature guidancedynamic integrationland cover mapping

0 comments

The pith

Dynamic mixing of captions from multiple MLLMs guides visual features for more accurate remote sensing scene segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MPerS, a method that generates multiple scene captions for remote sensing images by applying varied prompts to several MLLMs such as LLaVA, ChatGPT, and Qwen. These captions supply textual semantics that a Dynamic MixExperts module selects and combines on the fly. The selected text then directs visual features extracted by DINOv3 through Linguistic Query Guided Attention to produce pixel-level land-cover maps. The authors report that this multimodal approach outperforms prior methods on three standard remote sensing segmentation benchmarks. A sympathetic reader would care because remote sensing images often contain complex, ambiguous land-cover patterns where pure visual models struggle, and reliable text guidance could reduce the need for extensive labeled data.

Core claim

We design MPerS to let MLLMs perceive remote sensing scenes from diverse expert perspectives by generating high-quality captions with multiple prompts, employ DINOv3 for dense visual representations of land-covers, introduce a Dynamic MixExperts module that adaptively integrates the most effective textual semantics, and construct Linguistic Query Guided Attention to let the textual information guide visual features for precise segmentation, achieving superior performance on three public semantic segmentation RS datasets.

What carries the argument

The Dynamic MixExperts module that adaptively integrates the most effective textual semantics from MLLM captions, paired with Linguistic Query Guided Attention that uses those semantics to guide DINOv3 visual features.

If this is right

Textual semantics from multiple MLLM perspectives can be fused adaptively to improve segmentation accuracy in complex remote sensing scenes.
Linguistic Query Guided Attention allows caption information to directly refine visual feature maps for land-cover boundaries.
Superior results on three public RS segmentation datasets follow from the combination of diverse caption generation and dynamic expert selection.
The method reduces reliance on purely visual models by injecting scene-level textual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If future MLLMs produce even higher-quality or domain-specific RS captions, the same Dynamic MixExperts structure could yield further gains without changing the visual backbone.
The approach may extend to other dense prediction tasks such as change detection or instance segmentation in aerial imagery by reusing the caption-to-feature guidance pathway.
Failure modes would likely appear first in scenes where all MLLMs generate similar but incorrect descriptions, limiting the benefit of the mixing step.

Load-bearing premise

The captions produced by the chosen MLLMs through multiple prompts are consistently high-quality and relevant enough that the Dynamic MixExperts module can reliably pick and fuse the best ones for guiding segmentation.

What would settle it

Running the full MPerS pipeline on a remote sensing dataset where the MLLM captions contain systematic factual errors or hallucinations and measuring whether the reported performance gains over baselines disappear.

Figures

Figures reproduced from arXiv: 2605.10769 by Hongyang Zhang, Man On Pun, Xianping Ma, Ziyao Wang, Ziyi Wang.

**Figure 1.** Figure 1: Workflow of MPerS. Simple prompts may generate inappropriate captions, leading to erroneous perceptual understanding. This [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The pipeline for effective semantic text acquisition has [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The framework of MPerS, which encompasses four units: vision encoder, Dynamic MLLM MixExperts extract effective textual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the Linguistic Query Guided Attention. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative visual comparison with state-of-the-art methods on the Vaihingen dataset. Dashed bounding boxes indicate regions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPerS combines multiple prompted MLLMs with a dynamic mix-of-experts layer and linguistic attention to guide DINOv3 features for RS segmentation, but the gains rest on unverified caption quality.

read the letter

The paper's core move is to generate RS scene captions from LLaVA, ChatGPT, and Qwen using several prompts, then let a Dynamic MixExperts module pick and blend the most useful text to steer visual features through Linguistic Query Guided Attention on top of a DINOv3 backbone. This targets a real gap: most prior multimodal RS work focused on fusion architectures while assuming caption quality would be fine. The design is concrete and the motivation is clear from the abstract. It does a decent job spelling out how multiple expert perspectives on the same image could supply complementary land-cover cues that a single model might miss. The architecture itself looks implementable without exotic new components. The main weakness is that the abstract gives no numbers, no ablation on the caption step, and no check on whether the generated text actually aligns with ground-truth labels or adds signal beyond the visual encoder. If the MLLM outputs stay generic or miss spectral and scale details typical in RS, the whole multimodal claim collapses to standard segmentation with extra attention. The stress-test note is on point here; general-purpose models have no RS pretraining, so caption reliability needs direct evidence like human ratings or removal ablations. This paper is aimed at applied remote sensing groups that already run DINO-style backbones and want to test text guidance. It is solid enough on the problem framing and architecture to warrant a serious referee, provided the full version includes quantitative caption evaluation and statistical comparisons. I would send it for review but flag the caption quality issue as the first thing to verify.

Referee Report

3 major / 2 minor

Summary. The paper proposes MPerS, a multimodal method for remote sensing scene segmentation that generates RS-specific captions using multiple prompts with MLLMs (LLaVA, ChatGPT, Qwen), extracts visual features via DINOv3, adaptively fuses the most effective textual semantics through a Dynamic MixExperts module, and applies Linguistic Query Guided Attention to guide segmentation. It claims superior performance over baselines on three public RS semantic segmentation datasets.

Significance. If the superiority claim holds with proper validation, the work could meaningfully advance multimodal RS segmentation by shifting focus from pure architectural fusion to caption quality and adaptive expert selection. The introduction of Dynamic MixExperts and Linguistic Query Guided Attention offers a novel way to handle diverse textual semantics, which may generalize to other vision-language tasks in remote sensing where general-purpose MLLMs are applied to domain-specific imagery.

major comments (3)

[Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.
[Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.
[Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.

minor comments (2)

[Abstract] The abstract is overly dense; separating the problem statement, proposed components, and results into distinct sentences would improve readability.
[Method] Notation for 'Dynamic MixExperts module' and 'Linguistic Query Guided Attention' is introduced without cross-references to equations or figures defining their inputs/outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional evidence and clarity will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions to better support our empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.

Authors: We agree that the abstract should provide concrete quantitative support. In the revised manuscript, we will update the abstract to include specific mIoU (and other metric) values on the three datasets along with the observed improvements over the strongest baselines. We will also add a brief reference to the ablation studies and any statistical significance testing already performed in the experiments section. revision: yes
Referee: [Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.

Authors: This comment correctly identifies a missing validation step. While the method describes the multi-prompt strategy for generating RS-specific captions, we did not report direct quality metrics. We will add a dedicated evaluation subsection (or appendix) containing human ratings of caption relevance to land-cover classes on a sampled subset, plus an ablation that removes textual guidance entirely to quantify its contribution to final segmentation accuracy. revision: yes
Referee: [Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.

Authors: We acknowledge the need for more granular ablations. The current experiments compare against external baselines, but we will expand the ablation studies in the revised version to explicitly include: (1) DINOv3 features with a standard segmentation head only, (2) separate results for each of the three MLLMs (LLaVA, ChatGPT, Qwen) versus the Dynamic MixExperts combination, and (3) variants with and without the Linguistic Query Guided Attention module. These will be presented in additional tables to isolate the contribution of each proposed component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with independent performance claims

full rationale

The paper presents an empirical method for remote sensing segmentation that integrates MLLM-generated captions (from LLaVA, ChatGPT, Qwen via multiple prompts), DINOv3 visual features, a Dynamic MixExperts module, and Linguistic Query Guided Attention. The central claim of superior performance rests on reported results across three public datasets rather than any derivation chain, equations, or first-principles reduction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The architecture is described as a novel combination of external components without tautological equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the unverified assumption that MLLM-generated captions provide reliable semantic guidance for RS scenes and that the proposed modules function as intended; no independent evidence for these is given in the abstract.

axioms (1)

domain assumption Multiple expert prompts enable MLLMs to produce high-quality, diverse RS scene captions that improve multimodal fusion.
Invoked in the design of prompts for LLaVA, ChatGPT, and Qwen.

invented entities (2)

Dynamic MixExperts module no independent evidence
purpose: Adaptively selects and integrates the most effective textual semantics from different MLLMs.
New module introduced to handle varying caption quality across scenes.
Linguistic Query Guided Attention no independent evidence
purpose: Uses textual semantics to guide and refine visual features for segmentation.
Constructed specifically for this perception-guided pipeline.

pith-pipeline@v0.9.0 · 5492 in / 1366 out tokens · 58720 ms · 2026-05-12T03:13:04.587911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

[1]

Algorithms for semantic segmentation of multispectral re- mote sensing imagery using deep learning.ISPRS Journal of Photogrammetry and Remote Sensing, 2018. 2

work page 2018
[2]

In- stancecap: Improving text-to-video generation via instance- aware structured caption

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, and Ying Tai. In- stancecap: Improving text-to-video generation via instance- aware structured caption. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28974– 28983, 2025. 4

work page 2025
[3]

Liang Gao, Hui Liu, Minhang Yang, Long Chen, Yaling Wan, Zhengqing Xiao, and Yurong Qian. Stransfuse: Fus- ing swin transformer and convolutional neural network for remote sensing image semantic segmentation.IEEE journal of selected topics in applied earth observations and remote sensing, 14:10990–11003, 2021. 2

work page 2021
[4]

Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection

Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20405–20415, 2025. 3

work page 2025
[5]

Swin transformer embedding unet for remote sensing image semantic segmentation.IEEE transactions on geoscience and remote sensing, 60:1–15, 2022

Xin He, Yong Zhou, Jiaqi Zhao, Di Zhang, Rui Yao, and Yong Xue. Swin transformer embedding unet for remote sensing image semantic segmentation.IEEE transactions on geoscience and remote sensing, 60:1–15, 2022. 2

work page 2022
[6]

Ringmo-agent: A unified remote sensing foun- dation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776, 2025

Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. Ringmo-agent: A unified re- mote sensing foundation model for multi-platform and multi- modal reasoning.arXiv preprint arXiv:2507.20776, 2025. 3

work page arXiv 2025
[7]

A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation

Miao Hu, Yali Li, Lu Fang, and Shengjin Wang. A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15343–15352, 2021. 6

work page 2021
[8]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 3

work page 2025
[9]

Wenlan: Bridging vision and language by large-scale multi-modal pre-training

Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang, J Wen, H Zhang, B Xu, W Zheng, et al. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arxiv (2021).arXiv preprint arXiv:2103.06561, 2021. 3

work page arXiv 2021
[10]

Dilateformer: Multi- scale dilated transformer for visual recognition.IEEE trans- actions on multimedia, 25:8906–8919, 2023

Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Andy J Ma, Yaowei Wang, and Wei-Shi Zheng. Dilateformer: Multi- scale dilated transformer for visual recognition.IEEE trans- actions on multimedia, 25:8906–8919, 2023. 4

work page 2023
[11]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 3

work page 2025
[12]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4015–4026, 2023. 1

work page 2023
[13]

Hybrid-scale self-similarity ex- ploitation for remote sensing image super-resolution.IEEE Transactions on Geoscience and Remote Sensing, 60:1–10,

Sen Lei and Zhenwei Shi. Hybrid-scale self-similarity ex- ploitation for remote sensing image super-resolution.IEEE Transactions on Geoscience and Remote Sensing, 60:1–10,

work page
[14]

Haifeng Li, Kaijian Qiu, Li Chen, Xiaoming Mei, Liang Hong, and Chao Tao. Scattnet: Semantic segmentation net- work with spatial and channel attention mechanism for high- resolution remote sensing images.IEEE Geoscience and Re- mote Sensing Letters, 18(5):905–909, 2020. 2

work page 2020
[15]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 3

work page 2021
[16]

Multistage attention resu-net for semantic segmen- tation of fine-resolution remote sensing images.IEEE Geo- science and Remote Sensing Letters, 19:1–5, 2021

Rui Li, Shunyi Zheng, Chenxi Duan, Jianlin Su, and Ce Zhang. Multistage attention resu-net for semantic segmen- tation of fine-resolution remote sensing images.IEEE Geo- science and Remote Sensing Letters, 19:1–5, 2021. 2, 6

work page 2021
[17]

Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Libo Wang, and Peter M Atkinson. Abcnet: Attentive bilateral con- textual network for efficient semantic segmentation of fine- resolution remotely sensed imagery.ISPRS journal of pho- togrammetry and remote sensing, 181:84–98, 2021. 2

work page 2021
[18]

Dynamic updates for language adaptation in visual-language tracking

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 2

work page 2025
[19]

Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 2

work page 2024
[20]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3

work page 2024
[21]

Esms-net: Enhancing semantic-mask segmenta- tion network with pyramid atrousformer for remote sensing image.IEEE Transactions on Geoscience and Remote Sens- ing, 2024

Jiamin Liu, Ziyi Wang, Fulin Luo, Tan Guo, Feng Yang, and Xinbo Gao. Esms-net: Enhancing semantic-mask segmenta- tion network with pyramid atrousformer for remote sensing image.IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 2

work page 2024
[22]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 2

work page 2015
[23]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 3

work page 2019
[24]

Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Xianping Ma, Qianqian Wu, Xingyu Zhao, Xiaokang Zhang, Man-On Pun, and Bo Huang. Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 2

work page 2024
[25]

Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation.IEEE Geoscience and Remote Sens- ing Letters, 21:1–5, 2024

Xianping Ma, Xiaokang Zhang, and Man-On Pun. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation.IEEE Geoscience and Remote Sens- ing Letters, 21:1–5, 2024. 6

work page 2024
[26]

A unified framework with multimodal fine-tuning for remote sensing semantic segmentation.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–15, 2025

Xianping Ma, Xiaokang Zhang, Man-On Pun, and Bo Huang. A unified framework with multimodal fine-tuning for remote sensing semantic segmentation.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–15, 2025. 2

work page 2025
[27]

Cross-entropy loss functions: Theoretical analysis and applications

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications. InIn- ternational conference on Machine learning, pages 23803– 23828, 2023. 5

work page 2023
[28]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

work page 2022
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[30]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 3

work page 2022
[31]

Syndrone-multi-modal uav dataset for ur- ban scenarios

Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, and Pietro Zanuttigh. Syndrone-multi-modal uav dataset for ur- ban scenarios. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2210–2220,

work page
[32]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241, 2015. 2

work page 2015
[33]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Ctmfnet: Cnn and transformer multiscale fusion net- work of remote sensing urban scene imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–14, 2022

Pengfei Song, Jinjiang Li, Zhiyong An, Hui Fan, and Linwei Fan. Ctmfnet: Cnn and transformer multiscale fusion net- work of remote sensing urban scene imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–14, 2022. 2

work page 2022
[35]

2d semantic labeling dataset.Accessed: Apr., 2018

ISPRS Vaihingen. 2d semantic labeling dataset.Accessed: Apr., 2018. 2

work page 2018
[36]

Scale-aware neural network for semantic segmentation of multi-resolution remote sens- ing images.Remote sensing, 13(24):5015, 2021

Libo Wang, Ce Zhang, Rui Li, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Scale-aware neural network for semantic segmentation of multi-resolution remote sens- ing images.Remote sensing, 13(24):5015, 2021. 2

work page 2021
[37]

A novel transformer based se- mantic segmentation scheme for fine-resolution remote sens- ing images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

Libo Wang, Rui Li, Chenxi Duan, Ce Zhang, Xiaoliang Meng, and Shenghui Fang. A novel transformer based se- mantic segmentation scheme for fine-resolution remote sens- ing images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022. 6

work page 2022
[38]

Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery.ISPRS Journal of Pho- togrammetry and Remote Sensing, 190:196–214, 2022. 2, 5, 6

work page 2022
[39]

Libo Wang, Sijun Dong, Ying Chen, Xiaoliang Meng, Shenghui Fang, and Songlin Fei. Metasegnet: Metadata- collaborative vision-language representation learning for se- mantic segmentation of remote sensing images.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 2, 6

work page 2024
[40]

isaid: A large-scale dataset for instance segmentation in aerial images

Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 28–37, 2019. 2, 5

work page 2019
[41]

Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geo- science and Remote Sensing, 2025

Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geo- science and Remote Sensing, 2025. 3

work page 2025
[42]

Groupvit: Semantic segmentation emerges from text supervision

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 3

work page 2022
[43]

Bootstrapping interactive image–text alignment for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024

Cong Yang, Zuchao Li, and Lefei Zhang. Bootstrapping interactive image–text alignment for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 3

work page 2024
[44]

A joint-training two-stage method for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2022

Xiutiao Ye, Shuang Wang, Yu Gu, Jihui Wang, Ruixuan Wang, Biao Hou, Fausto Giunchiglia, and Licheng Jiao. A joint-training two-stage method for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2022. 2

work page 2022
[45]

arXiv preprint arXiv:2501.04001 , year=

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 4

work page arXiv 2025
[46]

Cheng Zhang, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang. Transformer and cnn hy- brid deep neural network for semantic segmentation of very- high-resolution remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022. 2

work page 2022
[47]

Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024

Shijie Zhang, Bin Zhang, Yuntao Wu, Huabing Zhou, Junjun Jiang, and Jiayi Ma. Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3, 6

work page 2024
[48]

Cooperative connection trans- former for remote sensing image captioning.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–14, 2024

Kai Zhao and Wei Xiong. Cooperative connection trans- former for remote sensing image captioning.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–14, 2024. 3

work page 2024
[49]

High-resolution remote sensing image captioning based on structured atten- tion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

Rui Zhao, Zhenwei Shi, and Zhengxia Zou. High-resolution remote sensing image captioning based on structured atten- tion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021. 2

work page 2021
[50]

Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, et al. Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

work page arXiv
[51]

Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qing- pei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3686–3696, 2025. 2

work page 2025
[52]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 2

work page 2025