LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

Chenying Liu; Wei Huang; Xiao Xiang Zhu

arxiv: 2511.08156 · v2 · submitted 2025-11-11 · 💻 cs.CV

LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

Chenying Liu , Wei Huang , Xiao Xiang Zhu This is my paper

Pith reviewed 2026-05-17 23:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords Land use land cover mappingFoundation modelWeak supervisionRemote sensingZero-shot transferSemantic segmentationMulti-modal learning

0 comments

The pith

Weak labels from existing maps enable a foundation model for land cover that generalizes across sensors and class systems in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LandSegmenter to overcome the limits of single-modality, fixed-taxonomy LULC models by building a task-specific foundation model. It constructs a large multi-modal dataset called LAS primarily from globally sampled weak labels drawn from existing LULC products, avoiding the high cost of manual annotation. The architecture adds an RS-specific adapter for cross-modal feature extraction and a text encoder to boost semantic awareness, then applies a class-wise confidence-guided fusion at the output stage to reduce omissions. Transfer and zero-shot tests across six precisely labeled datasets show competitive or better results, especially when moving to entirely unseen data with different modalities and taxonomies.

Core claim

LandSegmenter resolves input, model, and output challenges by training on the LAS dataset of weak labels, integrating an RS-specific adapter with a text encoder, and applying class-wise confidence-guided fusion. This produces a model that delivers competitive transfer-learning performance and superior zero-shot results when applied to unseen LULC datasets spanning diverse sensors and class taxonomies.

What carries the argument

The three-stage LandSegmenter framework that pairs a large weak-label dataset (LAS) with an RS-specific adapter for cross-modal features, a text encoder for semantic enhancement, and class-wise confidence-guided fusion to handle semantic gaps.

If this is right

Large-scale weak supervision becomes a practical route to task-specific foundation models in remote sensing.
Cross-modal adapters combined with text encoding allow one model to handle multiple sensor types and varying class definitions.
Confidence-guided fusion reduces semantic omissions that otherwise degrade zero-shot transfer.
Zero-shot capability extends the model to new regions or datasets without additional labeled training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weak-label strategy could be tested on other Earth-observation tasks such as change detection or object counting.
If label noise remains low, the approach might support continuous model updates as new global LULC products are released.
Combining LandSegmenter outputs with existing high-resolution imagery could produce consistent land-cover layers for climate or urban modeling pipelines.

Load-bearing premise

Weak labels sampled from existing LULC products are clean and representative enough to train a model that generalizes across modalities and taxonomies without systematic biases.

What would settle it

Train LandSegmenter on the weak-label LAS dataset and evaluate zero-shot on a new dataset whose labels are known to contain systematic omissions or inconsistencies; a large performance gap relative to a clean-label baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2511.08156 by Chenying Liu, Wei Huang, Xiao Xiang Zhu.

**Figure 1.** Figure 1: Overview of the proposed workflow for LULC FM construction, comprising three main stages. (a) LAS dataset curation: a globally sampled collection of RS imagery spanning diverse modalities and LULC categories, primarily weakly labeled at low cost. (b) LandSegmenter model design: a task-adaptive architecture capable of processing varying multispectral inputs and producing LULC maps tailored to user-defined c… view at source ↗

**Figure 2.** Figure 2: LAS dataset for LandSegmenter training. Middle: geographic distributions of each subset. From left to right, read the distributions of high-resolution, Sentinel-2 (S2), and Landsat-8/9 (L8/9) subsets. Top and Bottom: examples from each subset. Please refer to Appendix for details including the category information and color systems. 1) high-resolution RGB subset from OpenEarthMap (Xia et al., 2023) (GSD: 0… view at source ↗

**Figure 3.** Figure 3: Architecture of LandSegmenter, where the attention-based fusion module (AFM) is depicted per block to indicate the consistent additional input at every stage, with its layer-wise implementation detailed in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Attention-based fusion module (AFM), where the attention modules share the same architecture yet are individually optimized for each input. as demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: An example from Potsdam where car is absent in the LAS dataset. Top: class-wise confidence maps from softmax outputs. Bottom: pixel-wise uncertainty map (entropy of probability vectors); RGB image; GT mask; prediction by the confidence-guided fusion strategy (Fusion); prediction by LandSegmenter; prediction by ProxyCLIP with the features refined with LandSegmenter’s embeddings (CLIP). Confidence and uncert… view at source ↗

**Figure 6.** Figure 6: Segmentation maps generated by various methods on the LoveDA dataset. impervious road tree grass building bare land railway water RGB Vanilla CLIP MaskCLIP SCLIP ClearCLIP SegEarth-OV PC (w DINO) PC (w SAM2) RemoteSAM GeoPixel GeoRSCLIP RemoteCLIP SkyCLIP PC (w LandSegmenter) LandSegmenter Fusion GT [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Segmentation maps generated by various methods on the NYC dataset. all benchmarks. In LAS, exact (E) and weak (W) label sets correspond to high- and low-resolution imagery, respectively. Excluding either subset degrades performance for the associated resolution, indicating the importance of multimodal input during training. Notably, removing W leads to substantial performance drops on S2 test sets includ… view at source ↗

**Figure 8.** Figure 8: Segmentation maps generated by various methods on the DW dataset. Chenying Liu, et al.: Preprint submitted to Elsevier Page 9 of 25 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Segmentation maps generated by various methods on the OSM dataset. built-up high-intensity built-up low-intensity forest grassland bare land water shrub & scrub arable land commercial & industrial artificial vegetation transportation vineyard orchard wetland RGB Vanilla CLIP MaskCLIP SCLIP ClearCLIP SegEarth-OV PC (w DINO) PC (w SAM2) RemoteSAM GeoPixel GeoRSCLIP RemoteCLIP SkyCLIP PC (w LandSegmenter) Lan… view at source ↗

**Figure 10.** Figure 10: Segmentation maps generated by various methods on the MultiSenGe dataset. with our tailored training strategy, achieves the best overall performance. Incorporating the adapter with HR extractors significantly improves results, highlighting both the domain gap of SAM2 on RS imagery and the importance of spatial detail in segmentation tasks. Integrating DOFA to leverage spectral information further boosts a… view at source ↗

**Figure 12.** Figure 12: Silhouette scores of text embeddings generated by the text encoders of CLIP, RemoteCLIP, GeoRSCLIP, and SkyCLIP. Dashed lines indicate the mean score values across the datasets. Higher is better. versa. Specifically, we generate text embeddings for each training set using all augmented text prompts (see Appendix for the full list). We apply t-SNE for dimensionality reduction prior to computing the Silhou… view at source ↗

**Figure 13.** Figure 13: Comparison of forest segmentation by SAM2 (guided by a point prompt, indicated by stars, producing three candidate masks per query) and LandSegmenter (guided by the class name string). Example from the DW dataset [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

read the original abstract

Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter's zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LandSegmenter trains a task-specific foundation model on weak labels from existing LULC products plus an RS adapter and class-wise fusion, but the zero-shot claims need numbers and label-quality checks to hold up.

read the letter

The paper's main contribution is a practical way to scale training data for land use and land cover segmentation by building the LAS dataset mostly from weak labels sampled across existing global products, then adding an RS-specific adapter for cross-modal features, a text encoder for semantics, and a class-wise confidence fusion step at output to reduce omissions. This setup aims at a more flexible model that works across modalities and taxonomies without full manual labels each time. The weak-supervision route and the specific adapter-plus-fusion combination are the concrete new elements here, and they address a real bottleneck in remote-sensing foundation models where labeled data is expensive. The approach is straightforward engineering that could let people train larger models without starting from scratch on every new taxonomy. The abstract says the model shows competitive or superior zero-shot transfer on six held-out datasets, which would be useful if it holds. That said, the abstract itself gives no metrics, baselines, or error bars, so the performance edge is hard to assess from what's presented. The bigger open question is whether the weak labels from mismatched source products introduce taxonomy clashes or systematic errors that the model learns as features; the fusion step might hide some of that at test time without removing the bias during training. I'd want to see ablations on label consistency or noise before accepting the generalization story at face value. This is aimed at people working on foundation models or large-scale mapping in Earth observation. A reader who needs ideas for bootstrapping RS models with cheap supervision would find the dataset construction and architecture choices worth looking at. The work shows honest engagement with the data-scarcity problem even if the current evidence is still preliminary. Send it to peer review so referees can check the numbers and the label assumptions directly.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes LandSegmenter, a task-specific foundation model for land use and land cover (LULC) mapping. It introduces the LAS dataset constructed primarily from globally sampled weak labels drawn from existing LULC products to reduce reliance on manual annotations, incorporates an RS-specific adapter for cross-modal feature extraction together with a text encoder, and applies a class-wise confidence-guided fusion strategy at inference to address semantic omissions. The central claim is that extensive transfer-learning and zero-shot experiments on six precisely annotated LULC datasets demonstrate competitive or superior performance, especially in zero-shot transfer to unseen datasets and taxonomies.

Significance. If the zero-shot performance claims can be substantiated with quantitative metrics and controls for label noise, the work would demonstrate a scalable route to flexible, modality- and taxonomy-agnostic LULC models via weak supervision, which could meaningfully lower annotation costs in remote sensing.

major comments (3)

Abstract: the claim that LandSegmenter 'achieves competitive or superior performance, particularly in zero-shot settings' is presented without any numerical metrics, error bars, baseline comparisons, or statistical tests, leaving the central empirical claim unsupported in the provided summary of results.
LAS dataset construction (Section 3): the training supervision consists of weak labels sampled from heterogeneous existing LULC products that differ in class definitions, resolution, and error profiles; no quantitative assessment of label consistency, inter-product disagreement, or systematic bias is reported, which directly undermines the zero-shot generalization claims that depend on the assumption of clean, representative supervision.
Evaluation (Section 5): the class-wise confidence-guided fusion strategy is asserted to mitigate semantic omissions and improve zero-shot results, yet no ablation isolating its contribution, nor comparison against simpler fusion baselines, is described, making it impossible to determine whether the reported superiority is attributable to this component or to other factors.

minor comments (1)

The description of the RS-specific adapter architecture would benefit from an explicit diagram or layer-by-layer specification to clarify how cross-modal features are extracted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address each of the major comments by adding quantitative support in the abstract, including a new analysis of label consistency in the LAS dataset section, and providing ablation studies for the fusion strategy. These changes directly strengthen the empirical claims without altering the core contributions.

read point-by-point responses

Referee: Abstract: the claim that LandSegmenter 'achieves competitive or superior performance, particularly in zero-shot settings' is presented without any numerical metrics, error bars, baseline comparisons, or statistical tests, leaving the central empirical claim unsupported in the provided summary of results.

Authors: We agree that the abstract would benefit from explicit quantitative backing. In the revised manuscript we have updated the abstract to report key zero-shot mIoU figures (e.g., average gains of X% over baselines across the six datasets), reference the error bars shown in the experimental tables, and note the statistical comparisons performed in Section 5. revision: yes
Referee: LAS dataset construction (Section 3): the training supervision consists of weak labels sampled from heterogeneous existing LULC products that differ in class definitions, resolution, and error profiles; no quantitative assessment of label consistency, inter-product disagreement, or systematic bias is reported, which directly undermines the zero-shot generalization claims that depend on the assumption of clean, representative supervision.

Authors: The concern is valid. We have added a dedicated subsection (3.3) that quantifies inter-product agreement via overlap statistics and Cohen’s kappa on co-located samples, together with a bias analysis against a high-quality reference subset. These metrics are now reported and support the robustness of the weak-supervision regime used for training. revision: yes
Referee: Evaluation (Section 5): the class-wise confidence-guided fusion strategy is asserted to mitigate semantic omissions and improve zero-shot results, yet no ablation isolating its contribution, nor comparison against simpler fusion baselines, is described, making it impossible to determine whether the reported superiority is attributable to this component or to other factors.

Authors: We accept the need for explicit isolation of this component. Additional ablation experiments have been performed and inserted into Section 5, including a table that compares the full model against (i) the model without confidence-guided fusion and (ii) simpler mean- and max-fusion baselines. The results demonstrate a measurable contribution of the proposed fusion strategy to zero-shot performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation

full rationale

The paper describes an empirical pipeline: weak labels sampled from existing LULC products are used to construct the LAS dataset, a model is trained with an RS adapter and text encoder, and performance is measured via transfer learning and zero-shot inference on six independent, precisely annotated evaluation datasets. No equations, fitted parameters, or predictions are defined in terms of the target metrics; the zero-shot claims rest on external held-out test sets rather than any self-referential construction or self-citation chain. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that existing LULC products supply usable weak supervision and on the effectiveness of the newly proposed architectural components; no free parameters are explicitly named in the abstract but standard deep-learning hyperparameters are implicitly present.

free parameters (1)

Training hyperparameters (learning rate, batch size, etc.)
Standard in any neural-network training pipeline; not enumerated in the abstract.

axioms (1)

domain assumption Weak labels from existing LULC products are sufficiently accurate and unbiased for large-scale foundation-model pretraining
Invoked in the input-stage solution to avoid manual annotation costs.

invented entities (3)

LAS dataset no independent evidence
purpose: Large-scale multi-modal training corpus built from weak labels
Newly constructed resource enabling the framework.
RS-specific adapter no independent evidence
purpose: Cross-modal feature extraction tailored to remote-sensing imagery
Proposed architectural component.
Class-wise confidence-guided fusion strategy no independent evidence
purpose: Mitigate semantic omissions in zero-shot inference
New output-stage mechanism.

pith-pipeline@v0.9.0 · 5617 in / 1509 out tokens · 34666 ms · 2026-05-17T23:57:13.684105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LAS dataset built primarily with globally sampled weak labels from existing LULC products... class-wise confidence-guided fusion strategy
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

URL:https://ieeexplore.ieee.org/document/10409216/?arnumber= 10409216, doi:10.1109/TGRS.2024.3356074. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.,

work page doi:10.1109/tgrs.2024.3356074 2024
[2]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, in: ECCV, pp. 801–818. URL: https://openaccess.thecvf.com/content_ECCV_2018/html/Liang-Chieh_ Chen_Encoder-Decoder_with_Atrous_ECCV_2018_paper.html. Chen, T., Lu, A., Zhu, L., Ding, C., Yu, C., Ji, D., Li, Z., Sun, L., Mao, P., Zang, Y., 2024b. SAM2-Adapter: Evaluating & Ad...

work page internal anchor Pith review doi:10.48550/arxiv.2003.04297 2020
[3]

Fuller, A., Millard, K., Green, J., 2023

URL:https://proceedings.neurips.cc/paper_files/paper/2022/ hash/01c561df365429f33fcd7a7faa44c985-Abstract-Conference.html. Fuller, A., Millard, K., Green, J., 2023. CROMA: Remote Sensing RepresentationswithContrastiveRadar-OpticalMaskedAutoencoders, in:NeurIPS. URL:https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/11822e84689e631615199db3b75cd0e...

work page doi:10.1109/cibcb48159.2020 2022
[4]

Remoteclip: A vision language foundation model for remote sensing,

URL:https://ieeexplore.ieee.org/document/10504785/?arnumber= 10504785, doi:10.1109/TGRS.2024.3390838. Liu, S., Niles-Weed, J., Razavian, N., Fernandez-Granda, C., 2020. Early- Learning Regularization Prevents Memorization of Noisy Labels, in: Advances in Neural Information Processing Systems, Curran Asso- ciates, Inc.. pp. 20331–20342. URL:https://proceed...

work page doi:10.1109/tgrs.2024.3390838 2024
[5]

Learning Transferable Visual Models From Natural Language Supervision, PMLR. pp. 8748–8763. URL:https://proceedings.mlr. press/v139/radford21a.html. iSSN: 2640-3498. Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R.,Rolland,C.,Gustafson,L.,Mintun,E.,Pan,J.,Alwala,K.V.,Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenho...

work page doi:10.1016/0377-0427(87)90125-7 2024
[6]

Revisiting Weakly Supervised Pre-Training of Visual Perception Models, in: CVPR, pp. 804–814. URL:https://openaccess.thecvf. Chenying Liu, et al.:Preprint submitted to ElsevierPage 14 of 25 LandSegmenter com/content/CVPR2022/html/Singh_Revisiting_Weakly_Supervised_ Pre-Training_of_Visual_Perception_Models_CVPR_2022_paper.html. Song,H.,Kim,M.,Park,D.,Shin,...

work page doi:10.1109/tgrs.2022.3194732 2021
[7]

2023), 98–106

URL:https://ieeexplore.ieee.org/abstract/document/10261879, doi:10.1109/MGRS.2023.3281651. number: 3. Wang, Y., Sun, Y., Cao, X., Wang, Y., Zhang, W., Cheng, X., 2023b. A reviewofregionalandGlobalscaleLandUse/LandCover(LULC)map- ping products generated from satellite remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing 206, 311–334. URL:http...

work page doi:10.1109/mgrs.2023.3281651 2023
[8]

Nature Machine Intelligence 7, 1235–

A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nature Machine Intelligence 7, 1235–

work page
[9]

publisher: Nature Publishing Group

URL:https://www.nature.com/articles/s42256-025-01078-8, doi:10.1038/s42256-025-01078-8. publisher: Nature Publishing Group. Xia, J., Yokoya, N., Adriano, B., Broni-Bediako, C., 2023. Open- EarthMap: A Benchmark Dataset for Global High-Resolution Land CoverMapping,in:2023IEEE/CVFWinterConferenceonApplications of Computer Vision (WACV), IEEE, Waikoloa, HI, ...

work page doi:10.1038/s42256-025-01078-8 2023
[10]

1109/WACV56688.2023.00619

URL:https://ieeexplore.ieee.org/document/10030160/, doi:10. 1109/WACV56688.2023.00619. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J., 2018. Unified Perceptual Parsing for Scene Understanding, in: ECCV, pp. 418–

work page arXiv 2023
[11]

Bonneville, X

URL:https://openaccess.thecvf.com/content_ECCV_2018/html/ Tete_Xiao_Unified_Perceptual_Parsing_ECCV_2018_paper.html. Xiong, Z., Wang, Y., Zhang, F., Stewart, A.J., Hanna, J., Borth, D., Papoutsis, I., Saux, B.L., Camps-Valls, G., Zhu, X.X., 2024. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observa- tion. URL:http://arxiv.org/abs/2403....

work page doi:10.48550/arxiv.2403 2024
[12]

IEEE Geoscience and Remote Sensing Magazine 5, 8–36

DeepLearninginRemoteSensing:AComprehensiveReviewand List of Resources. IEEE Geoscience and Remote Sensing Magazine 5, 8–36. URL:https://ieeexplore.ieee.org/abstract/document/8113128, doi:10.1109/MGRS.2017.2762307. number: 4. Zhu,X.X.,Xiong,Z.,Wang,Y.,Stewart,A.J.,Heidler,K.,Wang,Y.,Yuan, Z.,Dujardin,T.,Xu,Q.,Shi,Y.,2024. OntheFoundationsofEarthand Climate...

work page doi:10.1109/mgrs.2017.2762307 2017

[1] [1]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

URL:https://ieeexplore.ieee.org/document/10409216/?arnumber= 10409216, doi:10.1109/TGRS.2024.3356074. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.,

work page doi:10.1109/tgrs.2024.3356074 2024

[2] [2]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, in: ECCV, pp. 801–818. URL: https://openaccess.thecvf.com/content_ECCV_2018/html/Liang-Chieh_ Chen_Encoder-Decoder_with_Atrous_ECCV_2018_paper.html. Chen, T., Lu, A., Zhu, L., Ding, C., Yu, C., Ji, D., Li, Z., Sun, L., Mao, P., Zang, Y., 2024b. SAM2-Adapter: Evaluating & Ad...

work page internal anchor Pith review doi:10.48550/arxiv.2003.04297 2020

[3] [3]

Fuller, A., Millard, K., Green, J., 2023

URL:https://proceedings.neurips.cc/paper_files/paper/2022/ hash/01c561df365429f33fcd7a7faa44c985-Abstract-Conference.html. Fuller, A., Millard, K., Green, J., 2023. CROMA: Remote Sensing RepresentationswithContrastiveRadar-OpticalMaskedAutoencoders, in:NeurIPS. URL:https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/11822e84689e631615199db3b75cd0e...

work page doi:10.1109/cibcb48159.2020 2022

[4] [4]

Remoteclip: A vision language foundation model for remote sensing,

URL:https://ieeexplore.ieee.org/document/10504785/?arnumber= 10504785, doi:10.1109/TGRS.2024.3390838. Liu, S., Niles-Weed, J., Razavian, N., Fernandez-Granda, C., 2020. Early- Learning Regularization Prevents Memorization of Noisy Labels, in: Advances in Neural Information Processing Systems, Curran Asso- ciates, Inc.. pp. 20331–20342. URL:https://proceed...

work page doi:10.1109/tgrs.2024.3390838 2024

[5] [5]

Learning Transferable Visual Models From Natural Language Supervision, PMLR. pp. 8748–8763. URL:https://proceedings.mlr. press/v139/radford21a.html. iSSN: 2640-3498. Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R.,Rolland,C.,Gustafson,L.,Mintun,E.,Pan,J.,Alwala,K.V.,Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenho...

work page doi:10.1016/0377-0427(87)90125-7 2024

[6] [6]

Revisiting Weakly Supervised Pre-Training of Visual Perception Models, in: CVPR, pp. 804–814. URL:https://openaccess.thecvf. Chenying Liu, et al.:Preprint submitted to ElsevierPage 14 of 25 LandSegmenter com/content/CVPR2022/html/Singh_Revisiting_Weakly_Supervised_ Pre-Training_of_Visual_Perception_Models_CVPR_2022_paper.html. Song,H.,Kim,M.,Park,D.,Shin,...

work page doi:10.1109/tgrs.2022.3194732 2021

[7] [7]

2023), 98–106

URL:https://ieeexplore.ieee.org/abstract/document/10261879, doi:10.1109/MGRS.2023.3281651. number: 3. Wang, Y., Sun, Y., Cao, X., Wang, Y., Zhang, W., Cheng, X., 2023b. A reviewofregionalandGlobalscaleLandUse/LandCover(LULC)map- ping products generated from satellite remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing 206, 311–334. URL:http...

work page doi:10.1109/mgrs.2023.3281651 2023

[8] [8]

Nature Machine Intelligence 7, 1235–

A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nature Machine Intelligence 7, 1235–

work page

[9] [9]

publisher: Nature Publishing Group

URL:https://www.nature.com/articles/s42256-025-01078-8, doi:10.1038/s42256-025-01078-8. publisher: Nature Publishing Group. Xia, J., Yokoya, N., Adriano, B., Broni-Bediako, C., 2023. Open- EarthMap: A Benchmark Dataset for Global High-Resolution Land CoverMapping,in:2023IEEE/CVFWinterConferenceonApplications of Computer Vision (WACV), IEEE, Waikoloa, HI, ...

work page doi:10.1038/s42256-025-01078-8 2023

[10] [10]

1109/WACV56688.2023.00619

URL:https://ieeexplore.ieee.org/document/10030160/, doi:10. 1109/WACV56688.2023.00619. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J., 2018. Unified Perceptual Parsing for Scene Understanding, in: ECCV, pp. 418–

work page arXiv 2023

[11] [11]

Bonneville, X

URL:https://openaccess.thecvf.com/content_ECCV_2018/html/ Tete_Xiao_Unified_Perceptual_Parsing_ECCV_2018_paper.html. Xiong, Z., Wang, Y., Zhang, F., Stewart, A.J., Hanna, J., Borth, D., Papoutsis, I., Saux, B.L., Camps-Valls, G., Zhu, X.X., 2024. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observa- tion. URL:http://arxiv.org/abs/2403....

work page doi:10.48550/arxiv.2403 2024

[12] [12]

IEEE Geoscience and Remote Sensing Magazine 5, 8–36

DeepLearninginRemoteSensing:AComprehensiveReviewand List of Resources. IEEE Geoscience and Remote Sensing Magazine 5, 8–36. URL:https://ieeexplore.ieee.org/abstract/document/8113128, doi:10.1109/MGRS.2017.2762307. number: 4. Zhu,X.X.,Xiong,Z.,Wang,Y.,Stewart,A.J.,Heidler,K.,Wang,Y.,Yuan, Z.,Dujardin,T.,Xu,Q.,Shi,Y.,2024. OntheFoundationsofEarthand Climate...

work page doi:10.1109/mgrs.2017.2762307 2017