pith. machine review for the scientific record. sign in

arxiv: 2605.10769 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords remote sensingsemantic segmentationmultimodal large language modelsmix of expertsscene captioningvisual feature guidancedynamic integrationland cover mapping
0
0 comments X

The pith

Dynamic mixing of captions from multiple MLLMs guides visual features for more accurate remote sensing scene segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MPerS, a method that generates multiple scene captions for remote sensing images by applying varied prompts to several MLLMs such as LLaVA, ChatGPT, and Qwen. These captions supply textual semantics that a Dynamic MixExperts module selects and combines on the fly. The selected text then directs visual features extracted by DINOv3 through Linguistic Query Guided Attention to produce pixel-level land-cover maps. The authors report that this multimodal approach outperforms prior methods on three standard remote sensing segmentation benchmarks. A sympathetic reader would care because remote sensing images often contain complex, ambiguous land-cover patterns where pure visual models struggle, and reliable text guidance could reduce the need for extensive labeled data.

Core claim

We design MPerS to let MLLMs perceive remote sensing scenes from diverse expert perspectives by generating high-quality captions with multiple prompts, employ DINOv3 for dense visual representations of land-covers, introduce a Dynamic MixExperts module that adaptively integrates the most effective textual semantics, and construct Linguistic Query Guided Attention to let the textual information guide visual features for precise segmentation, achieving superior performance on three public semantic segmentation RS datasets.

What carries the argument

The Dynamic MixExperts module that adaptively integrates the most effective textual semantics from MLLM captions, paired with Linguistic Query Guided Attention that uses those semantics to guide DINOv3 visual features.

If this is right

  • Textual semantics from multiple MLLM perspectives can be fused adaptively to improve segmentation accuracy in complex remote sensing scenes.
  • Linguistic Query Guided Attention allows caption information to directly refine visual feature maps for land-cover boundaries.
  • Superior results on three public RS segmentation datasets follow from the combination of diverse caption generation and dynamic expert selection.
  • The method reduces reliance on purely visual models by injecting scene-level textual understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If future MLLMs produce even higher-quality or domain-specific RS captions, the same Dynamic MixExperts structure could yield further gains without changing the visual backbone.
  • The approach may extend to other dense prediction tasks such as change detection or instance segmentation in aerial imagery by reusing the caption-to-feature guidance pathway.
  • Failure modes would likely appear first in scenes where all MLLMs generate similar but incorrect descriptions, limiting the benefit of the mixing step.

Load-bearing premise

The captions produced by the chosen MLLMs through multiple prompts are consistently high-quality and relevant enough that the Dynamic MixExperts module can reliably pick and fuse the best ones for guiding segmentation.

What would settle it

Running the full MPerS pipeline on a remote sensing dataset where the MLLM captions contain systematic factual errors or hallucinations and measuring whether the reported performance gains over baselines disappear.

Figures

Figures reproduced from arXiv: 2605.10769 by Hongyang Zhang, Man On Pun, Xianping Ma, Ziyao Wang, Ziyi Wang.

Figure 1
Figure 1. Figure 1: Workflow of MPerS. Simple prompts may generate inappropriate captions, leading to erroneous perceptual understanding. This [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline for effective semantic text acquisition has [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of MPerS, which encompasses four units: vision encoder, Dynamic MLLM MixExperts extract effective textual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the Linguistic Query Guided Attention. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visual comparison with state-of-the-art methods on the Vaihingen dataset. Dashed bounding boxes indicate regions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MPerS, a multimodal method for remote sensing scene segmentation that generates RS-specific captions using multiple prompts with MLLMs (LLaVA, ChatGPT, Qwen), extracts visual features via DINOv3, adaptively fuses the most effective textual semantics through a Dynamic MixExperts module, and applies Linguistic Query Guided Attention to guide segmentation. It claims superior performance over baselines on three public RS semantic segmentation datasets.

Significance. If the superiority claim holds with proper validation, the work could meaningfully advance multimodal RS segmentation by shifting focus from pure architectural fusion to caption quality and adaptive expert selection. The introduction of Dynamic MixExperts and Linguistic Query Guided Attention offers a novel way to handle diverse textual semantics, which may generalize to other vision-language tasks in remote sensing where general-purpose MLLMs are applied to domain-specific imagery.

major comments (3)
  1. [Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.
  2. [Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.
  3. [Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.
minor comments (2)
  1. [Abstract] The abstract is overly dense; separating the problem statement, proposed components, and results into distinct sentences would improve readability.
  2. [Method] Notation for 'Dynamic MixExperts module' and 'Linguistic Query Guided Attention' is introduced without cross-references to equations or figures defining their inputs/outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional evidence and clarity will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions to better support our empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.

    Authors: We agree that the abstract should provide concrete quantitative support. In the revised manuscript, we will update the abstract to include specific mIoU (and other metric) values on the three datasets along with the observed improvements over the strongest baselines. We will also add a brief reference to the ablation studies and any statistical significance testing already performed in the experiments section. revision: yes

  2. Referee: [Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.

    Authors: This comment correctly identifies a missing validation step. While the method describes the multi-prompt strategy for generating RS-specific captions, we did not report direct quality metrics. We will add a dedicated evaluation subsection (or appendix) containing human ratings of caption relevance to land-cover classes on a sampled subset, plus an ablation that removes textual guidance entirely to quantify its contribution to final segmentation accuracy. revision: yes

  3. Referee: [Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.

    Authors: We acknowledge the need for more granular ablations. The current experiments compare against external baselines, but we will expand the ablation studies in the revised version to explicitly include: (1) DINOv3 features with a standard segmentation head only, (2) separate results for each of the three MLLMs (LLaVA, ChatGPT, Qwen) versus the Dynamic MixExperts combination, and (3) variants with and without the Linguistic Query Guided Attention module. These will be presented in additional tables to isolate the contribution of each proposed component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with independent performance claims

full rationale

The paper presents an empirical method for remote sensing segmentation that integrates MLLM-generated captions (from LLaVA, ChatGPT, Qwen via multiple prompts), DINOv3 visual features, a Dynamic MixExperts module, and Linguistic Query Guided Attention. The central claim of superior performance rests on reported results across three public datasets rather than any derivation chain, equations, or first-principles reduction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The architecture is described as a novel combination of external components without tautological equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the unverified assumption that MLLM-generated captions provide reliable semantic guidance for RS scenes and that the proposed modules function as intended; no independent evidence for these is given in the abstract.

axioms (1)
  • domain assumption Multiple expert prompts enable MLLMs to produce high-quality, diverse RS scene captions that improve multimodal fusion.
    Invoked in the design of prompts for LLaVA, ChatGPT, and Qwen.
invented entities (2)
  • Dynamic MixExperts module no independent evidence
    purpose: Adaptively selects and integrates the most effective textual semantics from different MLLMs.
    New module introduced to handle varying caption quality across scenes.
  • Linguistic Query Guided Attention no independent evidence
    purpose: Uses textual semantics to guide and refine visual features for segmentation.
    Constructed specifically for this perception-guided pipeline.

pith-pipeline@v0.9.0 · 5492 in / 1366 out tokens · 58720 ms · 2026-05-12T03:13:04.587911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Algorithms for semantic segmentation of multispectral re- mote sensing imagery using deep learning.ISPRS Journal of Photogrammetry and Remote Sensing, 2018. 2

  2. [2]

    In- stancecap: Improving text-to-video generation via instance- aware structured caption

    Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, and Ying Tai. In- stancecap: Improving text-to-video generation via instance- aware structured caption. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28974– 28983, 2025. 4

  3. [3]

    Liang Gao, Hui Liu, Minhang Yang, Long Chen, Yaling Wan, Zhengqing Xiao, and Yurong Qian. Stransfuse: Fus- ing swin transformer and convolutional neural network for remote sensing image semantic segmentation.IEEE journal of selected topics in applied earth observations and remote sensing, 14:10990–11003, 2021. 2

  4. [4]

    Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection

    Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20405–20415, 2025. 3

  5. [5]

    Swin transformer embedding unet for remote sensing image semantic segmentation.IEEE transactions on geoscience and remote sensing, 60:1–15, 2022

    Xin He, Yong Zhou, Jiaqi Zhao, Di Zhang, Rui Yao, and Yong Xue. Swin transformer embedding unet for remote sensing image semantic segmentation.IEEE transactions on geoscience and remote sensing, 60:1–15, 2022. 2

  6. [6]

    Ringmo-agent: A unified remote sensing foun- dation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776, 2025

    Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. Ringmo-agent: A unified re- mote sensing foundation model for multi-platform and multi- modal reasoning.arXiv preprint arXiv:2507.20776, 2025. 3

  7. [7]

    A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation

    Miao Hu, Yali Li, Lu Fang, and Shengjin Wang. A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15343–15352, 2021. 6

  8. [8]

    Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 3

  9. [9]

    Wenlan: Bridging vision and language by large-scale multi-modal pre-training

    Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang, J Wen, H Zhang, B Xu, W Zheng, et al. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arxiv (2021).arXiv preprint arXiv:2103.06561, 2021. 3

  10. [10]

    Dilateformer: Multi- scale dilated transformer for visual recognition.IEEE trans- actions on multimedia, 25:8906–8919, 2023

    Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Andy J Ma, Yaowei Wang, and Wei-Shi Zheng. Dilateformer: Multi- scale dilated transformer for visual recognition.IEEE trans- actions on multimedia, 25:8906–8919, 2023. 4

  11. [11]

    Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 3

  12. [12]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4015–4026, 2023. 1

  13. [13]

    Hybrid-scale self-similarity ex- ploitation for remote sensing image super-resolution.IEEE Transactions on Geoscience and Remote Sensing, 60:1–10,

    Sen Lei and Zhenwei Shi. Hybrid-scale self-similarity ex- ploitation for remote sensing image super-resolution.IEEE Transactions on Geoscience and Remote Sensing, 60:1–10,

  14. [14]

    Haifeng Li, Kaijian Qiu, Li Chen, Xiaoming Mei, Liang Hong, and Chao Tao. Scattnet: Semantic segmentation net- work with spatial and channel attention mechanism for high- resolution remote sensing images.IEEE Geoscience and Re- mote Sensing Letters, 18(5):905–909, 2020. 2

  15. [15]

    Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 3

  16. [16]

    Multistage attention resu-net for semantic segmen- tation of fine-resolution remote sensing images.IEEE Geo- science and Remote Sensing Letters, 19:1–5, 2021

    Rui Li, Shunyi Zheng, Chenxi Duan, Jianlin Su, and Ce Zhang. Multistage attention resu-net for semantic segmen- tation of fine-resolution remote sensing images.IEEE Geo- science and Remote Sensing Letters, 19:1–5, 2021. 2, 6

  17. [17]

    Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Libo Wang, and Peter M Atkinson. Abcnet: Attentive bilateral con- textual network for efficient semantic segmentation of fine- resolution remotely sensed imagery.ISPRS journal of pho- togrammetry and remote sensing, 181:84–98, 2021. 2

  18. [18]

    Dynamic updates for language adaptation in visual-language tracking

    Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 2

  19. [19]

    Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024

    Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 2

  20. [20]

    Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3

  21. [21]

    Esms-net: Enhancing semantic-mask segmenta- tion network with pyramid atrousformer for remote sensing image.IEEE Transactions on Geoscience and Remote Sens- ing, 2024

    Jiamin Liu, Ziyi Wang, Fulin Luo, Tan Guo, Feng Yang, and Xinbo Gao. Esms-net: Enhancing semantic-mask segmenta- tion network with pyramid atrousformer for remote sensing image.IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 2

  22. [22]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 2

  23. [23]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 3

  24. [24]

    Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Xianping Ma, Qianqian Wu, Xingyu Zhao, Xiaokang Zhang, Man-On Pun, and Bo Huang. Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 2

  25. [25]

    Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation.IEEE Geoscience and Remote Sens- ing Letters, 21:1–5, 2024

    Xianping Ma, Xiaokang Zhang, and Man-On Pun. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation.IEEE Geoscience and Remote Sens- ing Letters, 21:1–5, 2024. 6

  26. [26]

    A unified framework with multimodal fine-tuning for remote sensing semantic segmentation.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–15, 2025

    Xianping Ma, Xiaokang Zhang, Man-On Pun, and Bo Huang. A unified framework with multimodal fine-tuning for remote sensing semantic segmentation.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–15, 2025. 2

  27. [27]

    Cross-entropy loss functions: Theoretical analysis and applications

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications. InIn- ternational conference on Machine learning, pages 23803– 23828, 2023. 5

  28. [28]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  30. [30]

    Denseclip: Language-guided dense prediction with context- aware prompting

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 3

  31. [31]

    Syndrone-multi-modal uav dataset for ur- ban scenarios

    Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, and Pietro Zanuttigh. Syndrone-multi-modal uav dataset for ur- ban scenarios. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2210–2220,

  32. [32]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241, 2015. 2

  33. [33]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3

  34. [34]

    Ctmfnet: Cnn and transformer multiscale fusion net- work of remote sensing urban scene imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–14, 2022

    Pengfei Song, Jinjiang Li, Zhiyong An, Hui Fan, and Linwei Fan. Ctmfnet: Cnn and transformer multiscale fusion net- work of remote sensing urban scene imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–14, 2022. 2

  35. [35]

    2d semantic labeling dataset.Accessed: Apr., 2018

    ISPRS Vaihingen. 2d semantic labeling dataset.Accessed: Apr., 2018. 2

  36. [36]

    Scale-aware neural network for semantic segmentation of multi-resolution remote sens- ing images.Remote sensing, 13(24):5015, 2021

    Libo Wang, Ce Zhang, Rui Li, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Scale-aware neural network for semantic segmentation of multi-resolution remote sens- ing images.Remote sensing, 13(24):5015, 2021. 2

  37. [37]

    A novel transformer based se- mantic segmentation scheme for fine-resolution remote sens- ing images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

    Libo Wang, Rui Li, Chenxi Duan, Ce Zhang, Xiaoliang Meng, and Shenghui Fang. A novel transformer based se- mantic segmentation scheme for fine-resolution remote sens- ing images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022. 6

  38. [38]

    Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery.ISPRS Journal of Pho- togrammetry and Remote Sensing, 190:196–214, 2022. 2, 5, 6

  39. [39]

    Libo Wang, Sijun Dong, Ying Chen, Xiaoliang Meng, Shenghui Fang, and Songlin Fei. Metasegnet: Metadata- collaborative vision-language representation learning for se- mantic segmentation of remote sensing images.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 2, 6

  40. [40]

    isaid: A large-scale dataset for instance segmentation in aerial images

    Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 28–37, 2019. 2, 5

  41. [41]

    Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geo- science and Remote Sensing, 2025

    Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geo- science and Remote Sensing, 2025. 3

  42. [42]

    Groupvit: Semantic segmentation emerges from text supervision

    Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 3

  43. [43]

    Bootstrapping interactive image–text alignment for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024

    Cong Yang, Zuchao Li, and Lefei Zhang. Bootstrapping interactive image–text alignment for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 3

  44. [44]

    A joint-training two-stage method for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2022

    Xiutiao Ye, Shuang Wang, Yu Gu, Jihui Wang, Ruixuan Wang, Biao Hou, Fausto Giunchiglia, and Licheng Jiao. A joint-training two-stage method for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2022. 2

  45. [45]

    arXiv preprint arXiv:2501.04001 , year=

    Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 4

  46. [46]

    Cheng Zhang, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang. Transformer and cnn hy- brid deep neural network for semantic segmentation of very- high-resolution remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022. 2

  47. [47]

    Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Shijie Zhang, Bin Zhang, Yuntao Wu, Huabing Zhou, Junjun Jiang, and Jiayi Ma. Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3, 6

  48. [48]

    Cooperative connection trans- former for remote sensing image captioning.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–14, 2024

    Kai Zhao and Wei Xiong. Cooperative connection trans- former for remote sensing image captioning.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–14, 2024. 3

  49. [49]

    High-resolution remote sensing image captioning based on structured atten- tion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

    Rui Zhao, Zhenwei Shi, and Zhengxia Zou. High-resolution remote sensing image captioning based on structured atten- tion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021. 2

  50. [50]

    Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

    Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, et al. Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,

  51. [51]

    Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories

    Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qing- pei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3686–3696, 2025. 2

  52. [52]

    Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

    Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 2