pith. sign in

arxiv: 2604.14582 · v1 · submitted 2026-04-16 · 💻 cs.CV

MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models

Pith reviewed 2026-05-10 11:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords land cover mappingmap super-resolutionvision foundation modelsweakly supervised learningprompt-based inferencecosine similaritygraph propagation
0
0 comments X

The pith

MapSR decouples supervision from training to generate high-resolution land cover maps from low-resolution labels using frozen vision foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MapSR as a prompt-driven approach to map super-resolution that turns coarse low-resolution land cover products into detailed high-resolution versions at the scale of the input imagery. It does so by using the low-resolution labels once to build class prompts from features of a frozen vision foundation model, then switches to training-free inference. This yields competitive accuracy on the Chesapeake Bay dataset while slashing trainable parameters by four orders of magnitude and cutting training time from hours to minutes. A sympathetic reader would care because dense high-resolution annotation is often prohibitively expensive, and the method makes scalable, label-efficient mapping practical under tight budgets.

Core claim

MapSR demonstrates that high-resolution land cover mapping can proceed without any high-resolution labels by first running a lightweight linear probe on low-resolution labels to identify high-confidence features in a frozen vision foundation model, aggregating those features into class prompts, and then obtaining predictions through cosine-similarity matching to the prompts followed by graph-based propagation for spatial refinement, reaching 59.64 percent mean intersection-over-union.

What carries the argument

Class prompt extraction via a linear probe on high-confidence features from the frozen vision model, followed by cosine-similarity matching and graph-based propagation for refinement.

If this is right

  • High-resolution land cover mapping becomes feasible with only low-resolution labels and minimal compute.
  • Trainable parameters drop by four orders of magnitude compared with retraining dense predictors.
  • The entire process completes in minutes rather than hours.
  • Performance remains competitive with the strongest weakly supervised baselines and can exceed some fully supervised ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-extraction pattern could extend to other dense-prediction tasks in remote sensing that currently demand heavy supervision.
  • Because the heavy model stays frozen, the approach may scale to larger geographic areas or newer foundation models without retraining costs.
  • Graph propagation for refinement hints that incorporating simple spatial priors can compensate for the lack of direct high-resolution supervision.

Load-bearing premise

High-confidence features selected by the linear probe on low-resolution labels will reliably match the correct land-cover classes in the corresponding high-resolution imagery.

What would settle it

A test on a new region or sensor where the foundation-model features fail to separate land-cover classes cleanly, or where low-resolution labels contain substantial noise, and the resulting mean intersection-over-union falls well below the reported weakly supervised baselines.

Figures

Figures reproduced from arXiv: 2604.14582 by Hanlin Wu, Jie Ma, Qi Yu, Ruiqi Wang.

Figure 1
Figure 1. Figure 1: Comparison of representative methods in terms of annotation cost [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MapSR. (a) Feature extraction and upsampling produce dense HR features from the input image using a frozen vision foundation model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison for HR land-cover mapping. (a) HR image; (b) HR [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at https://github.com/rikirikirikiriki/MapSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MapSR, a prompt-driven framework for land-cover map super-resolution that uses low-resolution (LR) labels only once to train a lightweight linear probe on frozen vision foundation model (VFM) features. High-confidence HR features are aggregated into class prompts, after which HR predictions are generated via training-free cosine-similarity matching and graph-based spatial refinement. On the Chesapeake Bay dataset, MapSR reports 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly-supervised baseline while surpassing a fully-supervised baseline, with trainable parameters reduced by four orders of magnitude and training time shortened from hours to minutes.

Significance. If the central empirical result holds, the work provides a practical, annotation-efficient route to high-resolution land-cover mapping that decouples supervision from dense model training. The training-free inference stage after a single linear probe, combined with the reported efficiency gains, could enable scalable mapping over large regions under limited compute and label budgets by leveraging existing VFMs.

major comments (2)
  1. [§3.2–3.3] §3.2–3.3 (Method): The high-confidence HR feature selection step relies on a linear probe trained solely on coarse LR labels. No ablation isolates this selection mechanism, and no direct validation of probe precision at HR resolution is provided (impossible without HR labels). If the probe assigns high confidence to mixed-class or misregistered features, the resulting prompts become noisy and the subsequent cosine-similarity + graph propagation cannot be guaranteed to recover accurate spatial predictions, undermining attribution of the 59.64% mIoU to the proposed method.
  2. [§4] §4 (Experiments): The claim that MapSR surpasses a fully-supervised baseline requires explicit details on the baseline implementation, including architecture, whether it uses the same VFM backbone, training data splits, and optimization settings. Without these, it is unclear whether the outperformance reflects a genuine advantage or differences in experimental protocol.
minor comments (2)
  1. [Abstract] Abstract and §4: The Chesapeake Bay dataset details (number of classes, exact train/test split, and LR label source) should be stated more explicitly to allow direct reproduction of the reported mIoU.
  2. [§3.4] Figure captions and §3.4: Clarify the graph-propagation hyperparameters and how they were chosen; a brief sensitivity analysis would strengthen the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below. We will revise the manuscript to incorporate additional ablations and implementation details where feasible.

read point-by-point responses
  1. Referee: [§3.2–3.3] §3.2–3.3 (Method): The high-confidence HR feature selection step relies on a linear probe trained solely on coarse LR labels. No ablation isolates this selection mechanism, and no direct validation of probe precision at HR resolution is provided (impossible without HR labels). If the probe assigns high confidence to mixed-class or misregistered features, the resulting prompts become noisy and the subsequent cosine-similarity + graph propagation cannot be guaranteed to recover accurate spatial predictions, undermining attribution of the 59.64% mIoU to the proposed method.

    Authors: We agree that direct validation of the linear probe's precision at HR resolution is impossible without HR labels, as this is inherent to the weakly-supervised setting. To isolate the high-confidence selection mechanism, we will add an ablation study in the revised manuscript comparing the full pipeline against variants without high-confidence filtering (e.g., using all features or random selection) and across different confidence thresholds. Regarding potential noise from mixed-class or misregistered features, the graph-based refinement is intended to enforce spatial consistency and mitigate local errors. We will clarify this design rationale and include the new ablation results to better attribute the reported mIoU. revision: partial

  2. Referee: [§4] §4 (Experiments): The claim that MapSR surpasses a fully-supervised baseline requires explicit details on the baseline implementation, including architecture, whether it uses the same VFM backbone, training data splits, and optimization settings. Without these, it is unclear whether the outperformance reflects a genuine advantage or differences in experimental protocol.

    Authors: We apologize for the insufficient detail on the fully-supervised baseline. In the revised manuscript we will add explicit implementation information: the baseline uses the identical VFM backbone with a trainable linear head, the same data splits as the weakly-supervised experiments, cross-entropy loss, Adam optimizer at learning rate 1e-3, and 50 training epochs. These details will be presented in a new table or subsection to ensure the comparison is transparent and reproducible. revision: yes

standing simulated objections not resolved
  • Direct validation of linear probe precision at HR resolution, which requires unavailable HR labels

Circularity Check

0 steps flagged

No circularity: MapSR derivation is empirical and self-contained

full rationale

The paper presents an empirical pipeline: a single linear probe is fit to LR labels solely to select high-confidence HR VFM features, which are then aggregated into prompts; HR maps are produced by cosine-similarity matching plus graph propagation. No equation, claim, or result reduces the reported 59.64% mIoU (or any performance number) to a quantity defined by the probe weights, to a self-referential definition, or to a self-citation chain. The method is explicitly training-free after the probe step and relies on external frozen VFM features, so the performance numbers are independent empirical measurements rather than tautological outputs of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of frozen VFM features to land-cover classes and the sufficiency of metric matching plus graph smoothing; no new physical entities are postulated and free parameters are limited to the lightweight probe.

free parameters (1)
  • linear probe weights
    Fitted once on LR labels to identify class prompts; the only learned component after the frozen VFM.
axioms (1)
  • domain assumption Features from the frozen vision foundation model are semantically aligned with land-cover classes when aggregated via the linear probe
    Invoked when high-confidence HR features are selected to form class prompts.

pith-pipeline@v0.9.0 · 5537 in / 1308 out tokens · 48666 ms · 2026-05-10T11:04:43.691449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Analysis of land use/land cover change, population shift, and their effects on spatiotemporal patterns of urban heat islands in metropolitan shanghai, china,

    H. Zhang, Z.-f. Qi, X.-y. Ye, Y .-b. Cai, W.-c. Ma, and M.-n. Chen, “Analysis of land use/land cover change, population shift, and their effects on spatiotemporal patterns of urban heat islands in metropolitan shanghai, china,”Appl. Geogr., vol. 44, pp. 121–133, 2013

  2. [2]

    Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery,

    J. C.-W. Chan and D. Paelinckx, “Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery,”Remote Sens. Environ., vol. 112, no. 6, pp. 2999–3011, 2008

  3. [3]

    Large scale high-resolution land cover mapping with multi-resolution data,

    C. Robinson, L. Hou, K. Malkin, R. Soobitsky, J. Czawlytko, B. Dilkina, and N. Jojic, “Large scale high-resolution land cover mapping with multi-resolution data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 726–12 735

  4. [4]

    A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images,

    L. Wang, R. Li, C. Duan, C. Zhang, X. Meng, and S. Fang, “A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images,”IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022

  5. [5]

    Learning without exact guid- ance: Updating large-scale high-resolution land cover maps from low- resolution historical labels,

    Z. Li, W. He, J. Li, F. Lu, and H. Zhang, “Learning without exact guid- ance: Updating large-scale high-resolution land cover maps from low- resolution historical labels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 717–27 727

  6. [6]

    Breaking the resolution barrier: A low-to-high network for large-scale high-resolution land-cover mapping using low-resolution labels,

    Z. Li, H. Zhang, F. Lu, R. Xue, G. Yang, and L. Zhang, “Breaking the resolution barrier: A low-to-high network for large-scale high-resolution land-cover mapping using low-resolution labels,”ISPRS J. Photogramm. Remote Sens., vol. 192, pp. 244–267, 2022

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikovet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  9. [9]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9650–9660

  10. [10]

    Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation,

    R. Docherty, A. Vamvakeros, and S. J. Cooper, “Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation,”arXiv preprint arXiv:2410.19836, 2025

  11. [11]

    Anyup: Universal feature upsampling

    T. Wimmer, P. Truong, M.-J. Rakotosaona, M. Oechsle, F. Tombari, B. Schiele, and J. E. Lenssen, “Anyup: Universal feature upsampling,” arXiv preprint arXiv:2510.12764, 2025

  12. [12]

    Learning with local and global consistency,

    D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch ¨olkopf, “Learning with local and global consistency,”Adv. Neural Inf. Process. Syst., vol. 16, 2003

  13. [13]

    Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation,

    V . Stojni ´c, Y . Kalantidis, J. Matas, and G. Tolias, “Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9794–9803

  14. [14]

    Slic superpixels compared to state-of-the-art superpixel methods,

    R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S ¨usstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, 2012

  15. [15]

    Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction,

    H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction,” inProceed- ings of the IEEE/CVF international conference on computer vision, 2023, pp. 17 302–17 313

  16. [16]

    Efficientvim: Efficient vision mamba with hidden state mixer based state space duality,

    S. Lee, J. Choi, and H. J. Kim, “Efficientvim: Efficient vision mamba with hidden state mixer based state space duality,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 923–14 933

  17. [17]

    Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,

    L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M. Atkinson, “Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,”ISPRS J. Pho- togramm. Remote Sens., vol. 190, pp. 196–214, 2022

  18. [18]

    Chesapeake bay land use/land cover (lulc) database 2024 edition,

    P. R. Claggett, S. M. McDonald, J. O’Neil-Dunne, S. MacFaden, K. Walker, S. Guinn, L. Ahmed, E. Buford, E. Kurtz, P. McCabeet al., “Chesapeake bay land use/land cover (lulc) database 2024 edition,”U.S. Geol. Surv. Data Release, p. 985, 2025

  19. [19]

    Least squares quantization in pcm,

    S. Lloyd, “Least squares quantization in pcm,”IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982