Recognition: 2 theorem links
· Lean TheoremDINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery
Pith reviewed 2026-05-08 18:18 UTC · model grok-4.3
The pith
A model built on the DINOv3 foundation model performs open-vocabulary semantic segmentation on remote sensing imagery at state-of-the-art levels without any remote sensing pre-training or backbone fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce CAFe-DINO, which leverages the DINOv3 backbone to achieve open-vocabulary semantic segmentation in remote sensing imagery. Through cost aggregation of text-image similarities and training-free upsampling, the model generates segmentation outputs. They fine-tune the system on an RS-targeted subset of COCO-Stuff rather than on remote sensing data, and report state-of-the-art results on standard RS benchmarks, surpassing other OVSS methods that do incorporate remote sensing fine-tuning.
What carries the argument
CAFe-DINO, which applies cost aggregation to DINOv3 text-image similarity scores and uses training-free feature upsampling to generate dense segmentation predictions from robust general features.
Load-bearing premise
DINOv3 provides sufficiently robust latent representations for remote sensing imagery that no RS-domain pre-training or backbone fine-tuning is required.
What would settle it
A test where fine-tuning the DINOv3 backbone on remote sensing data yields significantly better segmentation performance than CAFe-DINO on the same datasets would show the claim does not hold.
Figures
read the original abstract
The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine-tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO-bench segmentation benchmark without pre-training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS-domain fine-tuning. Our model, CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training-free upsampling of text-image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery; we instead fine-tune our model on a RS-targeted subset of COCO-Stuff. CAFe-DINO achieves state-of-the-art performance on key RS segmentation datasets, outperforming OVSS methods fine-tuned on RS data. Our code and data are publicly available at https://github.com/rfaulk/DINO_Soars.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CAFe-DINO, an open-vocabulary semantic segmentation (OVSS) model for remote sensing (RS) imagery that uses the DINOv3 backbone without any RS-domain pre-training or backbone fine-tuning. It combines cost aggregation with training-free upsampling of text-image similarity scores, fine-tunes only the CAFe components on an RS-targeted COCO-Stuff subset, and reports state-of-the-art performance on key RS segmentation benchmarks, outperforming existing OVSS methods that were fine-tuned on RS data. Code and data are released publicly.
Significance. If the results hold, the work demonstrates that natural-image pre-trained foundation models can transfer effectively to RS OVSS with minimal adaptation, addressing the scarcity of densely labeled RS data. The public code and data constitute a clear strength for reproducibility. The result would be notable if it can be shown that gains derive from the DINOv3 latents rather than solely from the aggregation/upsampling modules.
major comments (1)
- [Abstract and §3] Abstract and §3 (Methods): The central claim that 'the robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery' and enables SOTA results is load-bearing but rests on an untested assumption about domain robustness. RS imagery exhibits top-down geometry, multi-spectral statistics, and extreme scale variation outside DINOv3's natural-image pre-training distribution. An ablation comparing the reported no-backbone-fine-tuning setup against a version with RS backbone fine-tuning (or against RS-pretrained backbones) is required to confirm that outperformance is not attributable only to the cost-aggregation and upsampling tricks.
minor comments (2)
- [Abstract] Abstract: The claim of 'state-of-the-art performance on key RS segmentation datasets' is stated without naming the datasets or reporting any quantitative metrics or baselines; this should be expanded for immediate clarity.
- [§4] §4 (Experiments): The paper should explicitly list the exact RS benchmarks used, the competing OVSS methods (including whether their backbones were RS-fine-tuned), and error bars or statistical significance tests to support the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recognition of the work's potential significance for RS OVSS with minimal adaptation. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Methods): The central claim that 'the robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery' and enables SOTA results is load-bearing but rests on an untested assumption about domain robustness. RS imagery exhibits top-down geometry, multi-spectral statistics, and extreme scale variation outside DINOv3's natural-image pre-training distribution. An ablation comparing the reported no-backbone-fine-tuning setup against a version with RS backbone fine-tuning (or against RS-pretrained backbones) is required to confirm that outperformance is not attributable only to the cost-aggregation and upsampling tricks.
Authors: We agree that an explicit ablation would strengthen the central claim. Our work is motivated by DINOv3's prior results on the GEO-bench RS segmentation benchmark, where it outperformed RS-pretrained models without domain fine-tuning. However, we acknowledge this does not directly isolate contributions in the OVSS setting with CAFe. In the revised manuscript we will add an ablation that fine-tunes the DINOv3 backbone on RS data and compares performance against the frozen-backbone version. This will confirm that gains derive from the pre-trained latents rather than the aggregation/upsampling modules alone. We will also note that experiments use RGB RS imagery to match DINOv3's input distribution. revision: yes
Circularity Check
No circularity; empirical application of external DINOv3 backbone with independent benchmarks
full rationale
The paper's chain consists of citing DINOv3's prior GEO-bench results (external to this work), constructing CAFe-DINO via cost aggregation and upsampling on top of the frozen backbone, fine-tuning only the added components on an RS-targeted COCO-Stuff subset, and reporting experimental SOTA numbers on RS segmentation datasets. No equation or claim reduces by construction to its own inputs; the robustness premise is supported by the cited external benchmark rather than self-citation or redefinition, and public code enables independent verification. This is a standard empirical transfer-learning application with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DINOv3 latent representations are robust and transferable to remote sensing imagery without domain-specific fine-tuning of the backbone
Lean theorems connected to this paper
-
IndisputableMonolith.Cost (Jcost)Jcost_unit0 / Jcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAT-Seg [8] introduced cost aggregation for OVSS by framing similarity scores between CLIP text and image embeddings as cost maps. ... The cost aggregation refines the raw cost maps into per-class probability maps.
-
Foundation.AlphaCoordinateFixation / BranchSelectionn/a — no parameter-free derivation present unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the ViT-L variant of the DINOv3.txt model ... We resize images to 224×224 ... batch size of 4 for 45,000 iterations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://www.isprs.org/
2D Semantic Labeling. https://www.isprs.org/. 6
-
[2]
Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation
Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3689–3698, 2024. 3
2024
-
[3]
COCO- Stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO- Stuff: Thing and stuff classes in context. InComputer Vi- sion and Pattern Recognition (CVPR), 2018 IEEE Confer- ence On. IEEE, 2018. 2, 5
2018
-
[4]
Open-vocabulary high-resolution remote sensing image se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 63:1–14, 2025
Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. Open-vocabulary high-resolution remote sensing image se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 63:1–14, 2025. 3, 4, 7
2025
-
[5]
Pyramid Stereo Matching Network
Jia-Ren Chang and Yong-Sheng Chen. Pyramid Stereo Matching Network. In2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5410–5418,
-
[6]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2
2023
-
[7]
CATs++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023
Seokju Cho, Sunghwan Hong, and Seungryong Kim. CATs++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023. 3
2023
-
[8]
CAT-Seg: Cost Aggregation for Open-V ocabulary Semantic Segmenta- tion
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost Aggregation for Open-V ocabulary Semantic Segmenta- tion. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4113–4123, Seattle, W A, USA, 2024. IEEE. 3, 4, 6
2024
-
[9]
JAFAR: Jack up any feature at any resolution, 2025
Paul Couairon, Loick Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. JAFAR: Jack up any feature at any resolution, 2025. 2, 3
2025
-
[10]
Decoupling Zero-Shot Semantic Segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling Zero-Shot Semantic Segmentation. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11573–11582, 2022. 3
2022
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2020. 3
2020
-
[12]
AerOSeg: Harnessing SAM for Open-V ocabulary Segmentation in Remote Sens- ing Images
Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. AerOSeg: Harnessing SAM for Open-V ocabulary Segmentation in Remote Sens- ing Images. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2245–2255, Nashville, TN, USA, 2025. IEEE. 3
2025
-
[13]
Brandt, Axel Feld- mann, Zhoutong Zhang, and William T
Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feld- mann, Zhoutong Zhang, and William T. Freeman. FeatUp: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Repre- sentations, 2024. 2, 3
2024
-
[14]
Scal- ing Open-V ocabulary Image Segmentation with Image-Level Labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing Open-V ocabulary Image Segmentation with Image-Level Labels. InComputer Vision – ECCV 2022, pages 540–557. Springer Nature Switzerland, Cham, 2022. 3
2022
-
[15]
Group-Wise Correlation Stereo Network
Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-Wise Correlation Stereo Network. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3268–3277, Long Beach, CA, USA, 2019. IEEE. 3
2019
-
[16]
V o, Patrick Labatut, and Piotr Bojanowski
Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. DINOv2 meets text: A unified framework for image- and pixel-level vision-language alignment, 2024. 1, 3
2024
-
[17]
In Defense of Lazy Vi- sual Grounding for Open-V ocabulary Semantic Segmenta- tion
Dahyun Kang and Minsu Cho. In Defense of Lazy Vi- sual Grounding for Open-V ocabulary Semantic Segmenta- tion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI, pages 143–164, Berlin, Heidelberg,
2024
-
[18]
GEO-bench: Toward foundation models for earth monitoring
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexan- dre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. GEO-bench: Toward foundation models for earth monitoring. InProceedings of the 37th ...
2023
-
[19]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 3
2024
-
[20]
Language-driven semantic seg- mentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. InInternational Conference on Learning Rep- resentations, 2022. 3
2022
-
[21]
BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InICML,
-
[22]
BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742, Honolulu, Hawaii, USA,
-
[23]
Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images
Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 3, 6, 7, 1
2025
-
[24]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 3
2023
-
[25]
Open-V ocabulary Segmentation with Semantic- Assisted Calibration
Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, and Yan- song Tang. Open-V ocabulary Segmentation with Semantic- Assisted Calibration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3491–3500, 2024. 3
2024
-
[26]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, Montreal, QC, Canada,
2021
-
[27]
Torr, and Ser-Nam Lim
Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H.S. Torr, and Ser-Nam Lim. Open V o- cabulary Semantic Segmentation with Patch Aligned Con- trastive Learning. In2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19413– 19423, 2023. 3
2023
-
[28]
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning, 2021. 2
2021
-
[29]
Ex- plore the Potential of CLIP for Training-Free Open V ocab- ulary Semantic Segmentation
Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the Potential of CLIP for Training-Free Open V ocab- ulary Semantic Segmentation. InComputer Vision – ECCV 2024, pages 139–156, Cham, 2025. Springer Nature Switzer- land. 3
2024
-
[30]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
2025
-
[31]
CLIP as RNN: Segment countless visual concepts with- out training endeavor
Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. CLIP as RNN: Segment countless visual concepts with- out training endeavor. InCVPR, 2024. 3
2024
-
[32]
LiFT: A surprisingly simple lightweight feature transform for dense ViT descriptors
Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. LiFT: A surprisingly simple lightweight feature transform for dense ViT descriptors. InEuropean Confer- ence on Computer Vision, pages 110–128. Springer, 2025. 2, 3
2025
-
[33]
a rXiv preprint arXiv:2412.02732 (2024)
Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Hen- rique de Oliveira, João Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Trevor Keenan, Paulo Arévolo, W...
-
[34]
V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie
Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guil- laume Couairon, Huy V . V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie. Very high resolu- tion canopy height maps from RGB imagery using self- supervised vision transform...
2024
-
[35]
A survey on self- supervised methods for visual representation learning.Ma- chine Learning, 114(4):111, 2025
Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc Höftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, and Stefan Harmeling. A survey on self- supervised methods for visual representation learning.Ma- chine Learning, 114(4):111, 2025. 3
2025
-
[36]
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. 6
-
[37]
Diffusion model is secretly a training-free open vocab- ulary semantic segmenter,
Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023. 3
-
[38]
Text-to-image activation for open-vocabulary semantic segmentation in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–17, 2025
Wenzhen Wang, Aoran Xiao, Wei He, Hongyuan Zhu, and Liang Xiao. Text-to-image activation for open-vocabulary semantic segmentation in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–17, 2025. 3
2025
-
[39]
AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025
Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. AnyUp: Universal feature upsampling. arXiv preprint arXiv:2510.12764, 2025. 2, 3
-
[40]
OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping
Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6243–6253, 2023. 6
2023
-
[41]
Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024. 3
-
[42]
Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu
Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation, 2025. 2
2025
-
[43]
SAN: Side adapter network for open-vocabulary semantic segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15546–15561, 2023
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. SAN: Side adapter network for open-vocabulary semantic segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15546–15561, 2023. 3
2023
-
[44]
To- wards Open-V ocabulary Remote Sensing Image Semantic Segmentation.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 39(9):9436–9444, 2025
Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards Open-V ocabulary Remote Sensing Image Semantic Segmentation.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 39(9):9436–9444, 2025. 3, 6, 7
2025
-
[45]
LiT: Zero-Shot Transfer with Locked-image text Tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-Shot Transfer with Locked-image text Tuning. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18102–18112, 2022. 3
2022
-
[46]
SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024
Shijie Zhang, Bin Zhang, Yuntao Wu, Huabing Zhou, Junjun Jiang, and Jiayi Ma. SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3
2024
-
[47]
A survey on open- vocabulary detection and segmentation: Past, present, and future.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8954–8975, 2024
Chaoyang Zhu and Long Chen. A survey on open- vocabulary detection and segmentation: Past, present, and future.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8954–8975, 2024. 2 DINO Soars: DINOv3 for Open-V ocabulary Semantic Segmentation of Remote Sensing Imagery Supplementary Material Table 7. The prompt class name of the evaluat...
2024
-
[48]
agriculture
Prompt Details Every dataset contains a named set of classes. Instead of using these names directly, we have found it beneficial to make some logical modifications to the existing names, which we list here in Tab. 7 for reproducibility. Our heuris- tic in making substitutions is to target classes with complex or abstract semantic names (e.g. “agriculture”...
-
[49]
COCO-Stuff Subset We use a subset of the COCO-Stuff dataset containing only the following classes: Classes:bicycle, car, motorcycle, airplane, bus, train, truck, boat, bridge, building, bush, dirt, fence, grass, gravel, ground, hill, house, leaves, metal, mountain, mud, pavement, plant, platform, playing field, railing, railroad, river, road, rock, roof, ...
-
[50]
6), Vaihingen (Fig
Additional Cost Maps On the following pages, we provide additional cost maps for samples from the Potsdam (Fig. 6), Vaihingen (Fig. 7), LoveDA (Fig. 8), and OEM (Fig. 9) datasets. Road Building Low Veg. Tree Car Prediction CAFe-DINO DINOv3 True Mask Figure 6. Cost maps for a Potsdam image. Road Building Low Veg. Tree Car Prediction CAFe-DINO DINOv3 True M...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.