pith. machine review for the scientific record. sign in

arxiv: 2605.03175 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

DINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary semantic segmentationremote sensingDINOv3cost aggregationfeature upsamplingfoundation modelsCOCO-Stuff
0
0 comments X

The pith

A model built on the DINOv3 foundation model performs open-vocabulary semantic segmentation on remote sensing imagery at state-of-the-art levels without any remote sensing pre-training or backbone fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that DINOv3, a general vision foundation model, can be adapted for open-vocabulary semantic segmentation in remote sensing without requiring any training on remote sensing data for the core model. By using cost aggregation to combine similarity scores between image patches and text descriptions along with simple upsampling, the CAFe-DINO model produces accurate segmentation masks. The authors fine-tune only on a subset of the COCO-Stuff dataset that targets remote sensing-like scenes, and this suffices to reach better results than models that were specifically fine-tuned on remote sensing imagery. This approach addresses the scarcity of labeled remote sensing data by leveraging strong features from non-domain-specific training.

Core claim

The authors introduce CAFe-DINO, which leverages the DINOv3 backbone to achieve open-vocabulary semantic segmentation in remote sensing imagery. Through cost aggregation of text-image similarities and training-free upsampling, the model generates segmentation outputs. They fine-tune the system on an RS-targeted subset of COCO-Stuff rather than on remote sensing data, and report state-of-the-art results on standard RS benchmarks, surpassing other OVSS methods that do incorporate remote sensing fine-tuning.

What carries the argument

CAFe-DINO, which applies cost aggregation to DINOv3 text-image similarity scores and uses training-free feature upsampling to generate dense segmentation predictions from robust general features.

Load-bearing premise

DINOv3 provides sufficiently robust latent representations for remote sensing imagery that no RS-domain pre-training or backbone fine-tuning is required.

What would settle it

A test where fine-tuning the DINOv3 backbone on remote sensing data yields significantly better segmentation performance than CAFe-DINO on the same datasets would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.03175 by Ryan Faulkenberry, Saurabh Prasad.

Figure 1
Figure 1. Figure 1: Open-vocabulary segmentation maps from our model, CAFe-DINO. CAFe-DINO can accurately segment remote sensing scenes view at source ↗
Figure 2
Figure 2. Figure 2: DINOv3.txt alone is not a strong OVSS model for remote sensing view at source ↗
Figure 3
Figure 3. Figure 3: Details of the CAFe-DINO architecture. 2. Related Work 2.1. Open Vocabulary Semantic Segmentation Open vocabulary semantic segmentation (OVSS) [47] has advanced significantly with vision-language models (VLMs) [6, 21, 22, 28] that imbue an image encoder with text-conditioned reasoning. CLIP [28] introduced a con￾trastive method for aligning image and text embeddings, attaining state-of-the-art performance … view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative results for CAFe-DINO and other OVSS methods. Our method is remarkably accurate on urban scenes, but view at source ↗
Figure 5
Figure 5. Figure 5: DINOv3.txt cost maps for each of the Potsdam classes before (top row) and after (bottom row) CAFe-DINO aggregation. The view at source ↗
Figure 6
Figure 6. Figure 6: Cost maps for a Potsdam image. Road Building Low Veg. Tree Car Prediction CAFe-DINO DINOv3 True Mask view at source ↗
Figure 7
Figure 7. Figure 7: Cost maps for a Vaihingen image view at source ↗
Figure 8
Figure 8. Figure 8: Cost maps for a LoveDA image. Barren Grass Pavement Road Tree Water Cropland Building Prediction CAFe-DINO DINOv3 True Mask view at source ↗
Figure 9
Figure 9. Figure 9: Cost maps for an OEM image view at source ↗
read the original abstract

The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine-tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO-bench segmentation benchmark without pre-training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS-domain fine-tuning. Our model, CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training-free upsampling of text-image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery; we instead fine-tune our model on a RS-targeted subset of COCO-Stuff. CAFe-DINO achieves state-of-the-art performance on key RS segmentation datasets, outperforming OVSS methods fine-tuned on RS data. Our code and data are publicly available at https://github.com/rfaulk/DINO_Soars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CAFe-DINO, an open-vocabulary semantic segmentation (OVSS) model for remote sensing (RS) imagery that uses the DINOv3 backbone without any RS-domain pre-training or backbone fine-tuning. It combines cost aggregation with training-free upsampling of text-image similarity scores, fine-tunes only the CAFe components on an RS-targeted COCO-Stuff subset, and reports state-of-the-art performance on key RS segmentation benchmarks, outperforming existing OVSS methods that were fine-tuned on RS data. Code and data are released publicly.

Significance. If the results hold, the work demonstrates that natural-image pre-trained foundation models can transfer effectively to RS OVSS with minimal adaptation, addressing the scarcity of densely labeled RS data. The public code and data constitute a clear strength for reproducibility. The result would be notable if it can be shown that gains derive from the DINOv3 latents rather than solely from the aggregation/upsampling modules.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (Methods): The central claim that 'the robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery' and enables SOTA results is load-bearing but rests on an untested assumption about domain robustness. RS imagery exhibits top-down geometry, multi-spectral statistics, and extreme scale variation outside DINOv3's natural-image pre-training distribution. An ablation comparing the reported no-backbone-fine-tuning setup against a version with RS backbone fine-tuning (or against RS-pretrained backbones) is required to confirm that outperformance is not attributable only to the cost-aggregation and upsampling tricks.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'state-of-the-art performance on key RS segmentation datasets' is stated without naming the datasets or reporting any quantitative metrics or baselines; this should be expanded for immediate clarity.
  2. [§4] §4 (Experiments): The paper should explicitly list the exact RS benchmarks used, the competing OVSS methods (including whether their backbones were RS-fine-tuned), and error bars or statistical significance tests to support the SOTA claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recognition of the work's potential significance for RS OVSS with minimal adaptation. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methods): The central claim that 'the robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery' and enables SOTA results is load-bearing but rests on an untested assumption about domain robustness. RS imagery exhibits top-down geometry, multi-spectral statistics, and extreme scale variation outside DINOv3's natural-image pre-training distribution. An ablation comparing the reported no-backbone-fine-tuning setup against a version with RS backbone fine-tuning (or against RS-pretrained backbones) is required to confirm that outperformance is not attributable only to the cost-aggregation and upsampling tricks.

    Authors: We agree that an explicit ablation would strengthen the central claim. Our work is motivated by DINOv3's prior results on the GEO-bench RS segmentation benchmark, where it outperformed RS-pretrained models without domain fine-tuning. However, we acknowledge this does not directly isolate contributions in the OVSS setting with CAFe. In the revised manuscript we will add an ablation that fine-tunes the DINOv3 backbone on RS data and compares performance against the frozen-backbone version. This will confirm that gains derive from the pre-trained latents rather than the aggregation/upsampling modules alone. We will also note that experiments use RGB RS imagery to match DINOv3's input distribution. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical application of external DINOv3 backbone with independent benchmarks

full rationale

The paper's chain consists of citing DINOv3's prior GEO-bench results (external to this work), constructing CAFe-DINO via cost aggregation and upsampling on top of the frozen backbone, fine-tuning only the added components on an RS-targeted COCO-Stuff subset, and reporting experimental SOTA numbers on RS segmentation datasets. No equation or claim reduces by construction to its own inputs; the robustness premise is supported by the cited external benchmark rather than self-citation or redefinition, and public code enables independent verification. This is a standard empirical transfer-learning application with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of DINOv3 features to the RS domain and the effectiveness of cost aggregation for OVSS without additional RS pretraining.

axioms (1)
  • domain assumption DINOv3 latent representations are robust and transferable to remote sensing imagery without domain-specific fine-tuning of the backbone
    Explicitly stated in the abstract as eliminating the need for RS fine-tuning

pith-pipeline@v0.9.0 · 5524 in / 1119 out tokens · 46307 ms · 2026-05-08T18:18:41.243023+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 4 canonical work pages

  1. [1]

    https://www.isprs.org/

    2D Semantic Labeling. https://www.isprs.org/. 6

  2. [2]

    Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation

    Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3689–3698, 2024. 3

  3. [3]

    COCO- Stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO- Stuff: Thing and stuff classes in context. InComputer Vi- sion and Pattern Recognition (CVPR), 2018 IEEE Confer- ence On. IEEE, 2018. 2, 5

  4. [4]

    Open-vocabulary high-resolution remote sensing image se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 63:1–14, 2025

    Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. Open-vocabulary high-resolution remote sensing image se- mantic segmentation.IEEE Transactions on Geoscience and Remote Sensing, 63:1–14, 2025. 3, 4, 7

  5. [5]

    Pyramid Stereo Matching Network

    Jia-Ren Chang and Yong-Sheng Chen. Pyramid Stereo Matching Network. In2018 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5410–5418,

  6. [6]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2

  7. [7]

    CATs++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023

    Seokju Cho, Sunghwan Hong, and Seungryong Kim. CATs++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2023. 3

  8. [8]

    CAT-Seg: Cost Aggregation for Open-V ocabulary Semantic Segmenta- tion

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost Aggregation for Open-V ocabulary Semantic Segmenta- tion. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4113–4123, Seattle, W A, USA, 2024. IEEE. 3, 4, 6

  9. [9]

    JAFAR: Jack up any feature at any resolution, 2025

    Paul Couairon, Loick Chambon, Louis Serrano, Jean- Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. JAFAR: Jack up any feature at any resolution, 2025. 2, 3

  10. [10]

    Decoupling Zero-Shot Semantic Segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling Zero-Shot Semantic Segmentation. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11573–11582, 2022. 3

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representa- tions, 2020. 3

  12. [12]

    AerOSeg: Harnessing SAM for Open-V ocabulary Segmentation in Remote Sens- ing Images

    Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. AerOSeg: Harnessing SAM for Open-V ocabulary Segmentation in Remote Sens- ing Images. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2245–2255, Nashville, TN, USA, 2025. IEEE. 3

  13. [13]

    Brandt, Axel Feld- mann, Zhoutong Zhang, and William T

    Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feld- mann, Zhoutong Zhang, and William T. Freeman. FeatUp: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Repre- sentations, 2024. 2, 3

  14. [14]

    Scal- ing Open-V ocabulary Image Segmentation with Image-Level Labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing Open-V ocabulary Image Segmentation with Image-Level Labels. InComputer Vision – ECCV 2022, pages 540–557. Springer Nature Switzerland, Cham, 2022. 3

  15. [15]

    Group-Wise Correlation Stereo Network

    Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-Wise Correlation Stereo Network. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3268–3277, Long Beach, CA, USA, 2019. IEEE. 3

  16. [16]

    V o, Patrick Labatut, and Piotr Bojanowski

    Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. DINOv2 meets text: A unified framework for image- and pixel-level vision-language alignment, 2024. 1, 3

  17. [17]

    In Defense of Lazy Vi- sual Grounding for Open-V ocabulary Semantic Segmenta- tion

    Dahyun Kang and Minsu Cho. In Defense of Lazy Vi- sual Grounding for Open-V ocabulary Semantic Segmenta- tion. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI, pages 143–164, Berlin, Heidelberg,

  18. [18]

    GEO-bench: Toward foundation models for earth monitoring

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexan- dre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. GEO-bench: Toward foundation models for earth monitoring. InProceedings of the 37th ...

  19. [19]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 3

  20. [20]

    Language-driven semantic seg- mentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. InInternational Conference on Learning Rep- resentations, 2022. 3

  21. [21]

    BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InICML,

  22. [22]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning, pages 19730–19742, Honolulu, Hawaii, USA,

  23. [23]

    Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images

    Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 3, 6, 7, 1

  24. [24]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 3

  25. [25]

    Open-V ocabulary Segmentation with Semantic- Assisted Calibration

    Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, and Yan- song Tang. Open-V ocabulary Segmentation with Semantic- Assisted Calibration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3491–3500, 2024. 3

  26. [26]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, Montreal, QC, Canada,

  27. [27]

    Torr, and Ser-Nam Lim

    Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H.S. Torr, and Ser-Nam Lim. Open V o- cabulary Semantic Segmentation with Patch Aligned Con- trastive Learning. In2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19413– 19423, 2023. 3

  28. [28]

    Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I

    Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning, 2021. 2

  29. [29]

    Ex- plore the Potential of CLIP for Training-Free Open V ocab- ulary Semantic Segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the Potential of CLIP for Training-Free Open V ocab- ulary Semantic Segmentation. InComputer Vision – ECCV 2024, pages 139–156, Cham, 2025. Springer Nature Switzer- land. 3

  30. [30]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  31. [31]

    CLIP as RNN: Segment countless visual concepts with- out training endeavor

    Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. CLIP as RNN: Segment countless visual concepts with- out training endeavor. InCVPR, 2024. 3

  32. [32]

    LiFT: A surprisingly simple lightweight feature transform for dense ViT descriptors

    Saksham Suri, Matthew Walmer, Kamal Gupta, and Abhinav Shrivastava. LiFT: A surprisingly simple lightweight feature transform for dense ViT descriptors. InEuropean Confer- ence on Computer Vision, pages 110–128. Springer, 2025. 2, 3

  33. [33]

    a rXiv preprint arXiv:2412.02732 (2024)

    Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Hen- rique de Oliveira, João Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Trevor Keenan, Paulo Arévolo, W...

  34. [34]

    V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie

    Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guil- laume Couairon, Huy V . V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie. Very high resolu- tion canopy height maps from RGB imagery using self- supervised vision transform...

  35. [35]

    A survey on self- supervised methods for visual representation learning.Ma- chine Learning, 114(4):111, 2025

    Tobias Uelwer, Jan Robine, Stefan Sylvius Wagner, Marc Höftmann, Eric Upschulte, Sebastian Konietzny, Maike Behrendt, and Stefan Harmeling. A survey on self- supervised methods for visual representation learning.Ma- chine Learning, 114(4):111, 2025. 3

  36. [36]

    LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. 6

  37. [37]

    Diffusion model is secretly a training-free open vocab- ulary semantic segmenter,

    Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023. 3

  38. [38]

    Text-to-image activation for open-vocabulary semantic segmentation in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–17, 2025

    Wenzhen Wang, Aoran Xiao, Wei He, Hongyuan Zhu, and Liang Xiao. Text-to-image activation for open-vocabulary semantic segmentation in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–17, 2025. 3

  39. [39]

    AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025

    Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. AnyUp: Universal feature upsampling. arXiv preprint arXiv:2510.12764, 2025. 2, 3

  40. [40]

    OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping

    Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6243–6253, 2023. 6

  41. [41]

    Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024. 3

  42. [42]

    Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation, 2025. 2

  43. [43]

    SAN: Side adapter network for open-vocabulary semantic segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15546–15561, 2023

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. SAN: Side adapter network for open-vocabulary semantic segmentation.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15546–15561, 2023. 3

  44. [44]

    To- wards Open-V ocabulary Remote Sensing Image Semantic Segmentation.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 39(9):9436–9444, 2025

    Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards Open-V ocabulary Remote Sensing Image Semantic Segmentation.Proceedings of the AAAI Conference on Arti- ficial Intelligence, 39(9):9436–9444, 2025. 3, 6, 7

  45. [45]

    LiT: Zero-Shot Transfer with Locked-image text Tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-Shot Transfer with Locked-image text Tuning. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18102–18112, 2022. 3

  46. [46]

    SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Shijie Zhang, Bin Zhang, Yuntao Wu, Huabing Zhou, Junjun Jiang, and Jiayi Ma. SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3

  47. [47]

    A survey on open- vocabulary detection and segmentation: Past, present, and future.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8954–8975, 2024

    Chaoyang Zhu and Long Chen. A survey on open- vocabulary detection and segmentation: Past, present, and future.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8954–8975, 2024. 2 DINO Soars: DINOv3 for Open-V ocabulary Semantic Segmentation of Remote Sensing Imagery Supplementary Material Table 7. The prompt class name of the evaluat...

  48. [48]

    agriculture

    Prompt Details Every dataset contains a named set of classes. Instead of using these names directly, we have found it beneficial to make some logical modifications to the existing names, which we list here in Tab. 7 for reproducibility. Our heuris- tic in making substitutions is to target classes with complex or abstract semantic names (e.g. “agriculture”...

  49. [49]

    COCO-Stuff Subset We use a subset of the COCO-Stuff dataset containing only the following classes: Classes:bicycle, car, motorcycle, airplane, bus, train, truck, boat, bridge, building, bush, dirt, fence, grass, gravel, ground, hill, house, leaves, metal, mountain, mud, pavement, plant, platform, playing field, railing, railroad, river, road, rock, roof, ...

  50. [50]

    6), Vaihingen (Fig

    Additional Cost Maps On the following pages, we provide additional cost maps for samples from the Potsdam (Fig. 6), Vaihingen (Fig. 7), LoveDA (Fig. 8), and OEM (Fig. 9) datasets. Road Building Low Veg. Tree Car Prediction CAFe-DINO DINOv3 True Mask Figure 6. Cost maps for a Potsdam image. Road Building Low Veg. Tree Car Prediction CAFe-DINO DINOv3 True M...