Recognition: no theorem link
MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation
Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3
The pith
Dynamic mixing of captions from multiple MLLMs guides visual features for more accurate remote sensing scene segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design MPerS to let MLLMs perceive remote sensing scenes from diverse expert perspectives by generating high-quality captions with multiple prompts, employ DINOv3 for dense visual representations of land-covers, introduce a Dynamic MixExperts module that adaptively integrates the most effective textual semantics, and construct Linguistic Query Guided Attention to let the textual information guide visual features for precise segmentation, achieving superior performance on three public semantic segmentation RS datasets.
What carries the argument
The Dynamic MixExperts module that adaptively integrates the most effective textual semantics from MLLM captions, paired with Linguistic Query Guided Attention that uses those semantics to guide DINOv3 visual features.
If this is right
- Textual semantics from multiple MLLM perspectives can be fused adaptively to improve segmentation accuracy in complex remote sensing scenes.
- Linguistic Query Guided Attention allows caption information to directly refine visual feature maps for land-cover boundaries.
- Superior results on three public RS segmentation datasets follow from the combination of diverse caption generation and dynamic expert selection.
- The method reduces reliance on purely visual models by injecting scene-level textual understanding.
Where Pith is reading between the lines
- If future MLLMs produce even higher-quality or domain-specific RS captions, the same Dynamic MixExperts structure could yield further gains without changing the visual backbone.
- The approach may extend to other dense prediction tasks such as change detection or instance segmentation in aerial imagery by reusing the caption-to-feature guidance pathway.
- Failure modes would likely appear first in scenes where all MLLMs generate similar but incorrect descriptions, limiting the benefit of the mixing step.
Load-bearing premise
The captions produced by the chosen MLLMs through multiple prompts are consistently high-quality and relevant enough that the Dynamic MixExperts module can reliably pick and fuse the best ones for guiding segmentation.
What would settle it
Running the full MPerS pipeline on a remote sensing dataset where the MLLM captions contain systematic factual errors or hallucinations and measuring whether the reported performance gains over baselines disappear.
Figures
read the original abstract
The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MPerS, a multimodal method for remote sensing scene segmentation that generates RS-specific captions using multiple prompts with MLLMs (LLaVA, ChatGPT, Qwen), extracts visual features via DINOv3, adaptively fuses the most effective textual semantics through a Dynamic MixExperts module, and applies Linguistic Query Guided Attention to guide segmentation. It claims superior performance over baselines on three public RS semantic segmentation datasets.
Significance. If the superiority claim holds with proper validation, the work could meaningfully advance multimodal RS segmentation by shifting focus from pure architectural fusion to caption quality and adaptive expert selection. The introduction of Dynamic MixExperts and Linguistic Query Guided Attention offers a novel way to handle diverse textual semantics, which may generalize to other vision-language tasks in remote sensing where general-purpose MLLMs are applied to domain-specific imagery.
major comments (3)
- [Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.
- [Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.
- [Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.
minor comments (2)
- [Abstract] The abstract is overly dense; separating the problem statement, proposed components, and results into distinct sentences would improve readability.
- [Method] Notation for 'Dynamic MixExperts module' and 'Linguistic Query Guided Attention' is introduced without cross-references to equations or figures defining their inputs/outputs.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where additional evidence and clarity will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions to better support our empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'superior performance' on three datasets is unsupported by any quantitative metrics, ablation results, statistical tests, or error analysis, which is load-bearing for the central empirical claim and prevents assessment of whether gains exceed standard variance.
Authors: We agree that the abstract should provide concrete quantitative support. In the revised manuscript, we will update the abstract to include specific mIoU (and other metric) values on the three datasets along with the observed improvements over the strongest baselines. We will also add a brief reference to the ablation studies and any statistical significance testing already performed in the experiments section. revision: yes
-
Referee: [Method] Method section (Dynamic MixExperts and Linguistic Query Guided Attention): the performance attribution depends on MLLM captions reliably encoding RS-specific land-cover details rather than generic or hallucinatory content, yet no quantitative caption evaluation (human ratings, label alignment scores, or ablation removing text guidance) is reported to substantiate this weakest assumption.
Authors: This comment correctly identifies a missing validation step. While the method describes the multi-prompt strategy for generating RS-specific captions, we did not report direct quality metrics. We will add a dedicated evaluation subsection (or appendix) containing human ratings of caption relevance to land-cover classes on a sampled subset, plus an ablation that removes textual guidance entirely to quantify its contribution to final segmentation accuracy. revision: yes
-
Referee: [Experiments] Experiments: without ablations isolating the contribution of caption selection versus the DINOv3 backbone alone, or comparisons across the three MLLMs, it is unclear whether the proposed modules drive the claimed gains or if results reduce to the visual backbone plus a standard segmentation head.
Authors: We acknowledge the need for more granular ablations. The current experiments compare against external baselines, but we will expand the ablation studies in the revised version to explicitly include: (1) DINOv3 features with a standard segmentation head only, (2) separate results for each of the three MLLMs (LLaVA, ChatGPT, Qwen) versus the Dynamic MixExperts combination, and (3) variants with and without the Linguistic Query Guided Attention module. These will be presented in additional tables to isolate the contribution of each proposed component. revision: yes
Circularity Check
No circularity: empirical architecture with independent performance claims
full rationale
The paper presents an empirical method for remote sensing segmentation that integrates MLLM-generated captions (from LLaVA, ChatGPT, Qwen via multiple prompts), DINOv3 visual features, a Dynamic MixExperts module, and Linguistic Query Guided Attention. The central claim of superior performance rests on reported results across three public datasets rather than any derivation chain, equations, or first-principles reduction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The architecture is described as a novel combination of external components without tautological equivalence to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple expert prompts enable MLLMs to produce high-quality, diverse RS scene captions that improve multimodal fusion.
invented entities (2)
-
Dynamic MixExperts module
no independent evidence
-
Linguistic Query Guided Attention
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Algorithms for semantic segmentation of multispectral re- mote sensing imagery using deep learning.ISPRS Journal of Photogrammetry and Remote Sensing, 2018. 2
work page 2018
-
[2]
In- stancecap: Improving text-to-video generation via instance- aware structured caption
Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, and Ying Tai. In- stancecap: Improving text-to-video generation via instance- aware structured caption. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28974– 28983, 2025. 4
work page 2025
-
[3]
Liang Gao, Hui Liu, Minhang Yang, Long Chen, Yaling Wan, Zhengqing Xiao, and Yurong Qian. Stransfuse: Fus- ing swin transformer and convolutional neural network for remote sensing image semantic segmentation.IEEE journal of selected topics in applied earth observations and remote sensing, 14:10990–11003, 2021. 2
work page 2021
-
[4]
Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection
Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20405–20415, 2025. 3
work page 2025
-
[5]
Xin He, Yong Zhou, Jiaqi Zhao, Di Zhang, Rui Yao, and Yong Xue. Swin transformer embedding unet for remote sensing image semantic segmentation.IEEE transactions on geoscience and remote sensing, 60:1–15, 2022. 2
work page 2022
-
[6]
Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. Ringmo-agent: A unified re- mote sensing foundation model for multi-platform and multi- modal reasoning.arXiv preprint arXiv:2507.20776, 2025. 3
-
[7]
A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation
Miao Hu, Yali Li, Lu Fang, and Shengjin Wang. A2-fpn: Attention aggregation based feature pyramid network for in- stance segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15343–15352, 2021. 6
work page 2021
-
[8]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 3
work page 2025
-
[9]
Wenlan: Bridging vision and language by large-scale multi-modal pre-training
Y Huo, M Zhang, G Liu, H Lu, Y Gao, G Yang, J Wen, H Zhang, B Xu, W Zheng, et al. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arxiv (2021).arXiv preprint arXiv:2103.06561, 2021. 3
-
[10]
Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Andy J Ma, Yaowei Wang, and Wei-Shi Zheng. Dilateformer: Multi- scale dilated transformer for visual recognition.IEEE trans- actions on multimedia, 25:8906–8919, 2023. 4
work page 2023
-
[11]
Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment
Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 3
work page 2025
-
[12]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4015–4026, 2023. 1
work page 2023
-
[13]
Sen Lei and Zhenwei Shi. Hybrid-scale self-similarity ex- ploitation for remote sensing image super-resolution.IEEE Transactions on Geoscience and Remote Sensing, 60:1–10,
-
[14]
Haifeng Li, Kaijian Qiu, Li Chen, Xiaoming Mei, Liang Hong, and Chao Tao. Scattnet: Semantic segmentation net- work with spatial and channel attention mechanism for high- resolution remote sensing images.IEEE Geoscience and Re- mote Sensing Letters, 18(5):905–909, 2020. 2
work page 2020
-
[15]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 3
work page 2021
-
[16]
Rui Li, Shunyi Zheng, Chenxi Duan, Jianlin Su, and Ce Zhang. Multistage attention resu-net for semantic segmen- tation of fine-resolution remote sensing images.IEEE Geo- science and Remote Sensing Letters, 19:1–5, 2021. 2, 6
work page 2021
-
[17]
Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Libo Wang, and Peter M Atkinson. Abcnet: Attentive bilateral con- textual network for efficient semantic segmentation of fine- resolution remotely sensed imagery.ISPRS journal of pho- togrammetry and remote sensing, 181:84–98, 2021. 2
work page 2021
-
[18]
Dynamic updates for language adaptation in visual-language tracking
Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19165–19174, 2025. 2
work page 2025
-
[19]
Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 2
work page 2024
-
[20]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3
work page 2024
-
[21]
Jiamin Liu, Ziyi Wang, Fulin Luo, Tan Guo, Feng Yang, and Xinbo Gao. Esms-net: Enhancing semantic-mask segmenta- tion network with pyramid atrousformer for remote sensing image.IEEE Transactions on Geoscience and Remote Sens- ing, 2024. 2
work page 2024
-
[22]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 2
work page 2015
-
[23]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 3
work page 2019
-
[24]
Xianping Ma, Qianqian Wu, Xingyu Zhao, Xiaokang Zhang, Man-On Pun, and Bo Huang. Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 2
work page 2024
-
[25]
Xianping Ma, Xiaokang Zhang, and Man-On Pun. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation.IEEE Geoscience and Remote Sens- ing Letters, 21:1–5, 2024. 6
work page 2024
-
[26]
Xianping Ma, Xiaokang Zhang, Man-On Pun, and Bo Huang. A unified framework with multimodal fine-tuning for remote sensing semantic segmentation.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–15, 2025. 2
work page 2025
-
[27]
Cross-entropy loss functions: Theoretical analysis and applications
Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications. InIn- ternational conference on Machine learning, pages 23803– 23828, 2023. 5
work page 2023
-
[28]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3
work page 2022
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[30]
Denseclip: Language-guided dense prediction with context- aware prompting
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 3
work page 2022
-
[31]
Syndrone-multi-modal uav dataset for ur- ban scenarios
Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, and Pietro Zanuttigh. Syndrone-multi-modal uav dataset for ur- ban scenarios. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2210–2220,
-
[32]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241, 2015. 2
work page 2015
-
[33]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Pengfei Song, Jinjiang Li, Zhiyong An, Hui Fan, and Linwei Fan. Ctmfnet: Cnn and transformer multiscale fusion net- work of remote sensing urban scene imagery.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–14, 2022. 2
work page 2022
-
[35]
2d semantic labeling dataset.Accessed: Apr., 2018
ISPRS Vaihingen. 2d semantic labeling dataset.Accessed: Apr., 2018. 2
work page 2018
-
[36]
Libo Wang, Ce Zhang, Rui Li, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Scale-aware neural network for semantic segmentation of multi-resolution remote sens- ing images.Remote sensing, 13(24):5015, 2021. 2
work page 2021
-
[37]
Libo Wang, Rui Li, Chenxi Duan, Ce Zhang, Xiaoliang Meng, and Shenghui Fang. A novel transformer based se- mantic segmentation scheme for fine-resolution remote sens- ing images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022. 6
work page 2022
-
[38]
Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter M Atkinson. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery.ISPRS Journal of Pho- togrammetry and Remote Sensing, 190:196–214, 2022. 2, 5, 6
work page 2022
-
[39]
Libo Wang, Sijun Dong, Ying Chen, Xiaoliang Meng, Shenghui Fang, and Songlin Fei. Metasegnet: Metadata- collaborative vision-language representation learning for se- mantic segmentation of remote sensing images.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 2, 6
work page 2024
-
[40]
isaid: A large-scale dataset for instance segmentation in aerial images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 28–37, 2019. 2, 5
work page 2019
-
[41]
Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geo- science and Remote Sensing, 2025. 3
work page 2025
-
[42]
Groupvit: Semantic segmentation emerges from text supervision
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 3
work page 2022
-
[43]
Cong Yang, Zuchao Li, and Lefei Zhang. Bootstrapping interactive image–text alignment for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 3
work page 2024
-
[44]
Xiutiao Ye, Shuang Wang, Yu Gu, Jihui Wang, Ruixuan Wang, Biao Hou, Fausto Giunchiglia, and Licheng Jiao. A joint-training two-stage method for remote sensing image captioning.IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, 2022. 2
work page 2022
-
[45]
arXiv preprint arXiv:2501.04001 , year=
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 4
-
[46]
Cheng Zhang, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang. Transformer and cnn hy- brid deep neural network for semantic segmentation of very- high-resolution remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022. 2
work page 2022
-
[47]
Shijie Zhang, Bin Zhang, Yuntao Wu, Huabing Zhou, Junjun Jiang, and Jiayi Ma. Segclip: Multimodal visual-language and prompt learning for high-resolution remote sensing se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3, 6
work page 2024
-
[48]
Kai Zhao and Wei Xiong. Cooperative connection trans- former for remote sensing image captioning.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–14, 2024. 3
work page 2024
-
[49]
Rui Zhao, Zhenwei Shi, and Zhengxia Zou. High-resolution remote sensing image captioning based on structured atten- tion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021. 2
work page 2021
-
[50]
Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,
Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, et al. Image fusion via vision-language model.arXiv preprint arXiv:2402.02235,
-
[51]
Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qing- pei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3686–3696, 2025. 2
work page 2025
-
[52]
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 2
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.