pith. sign in

arxiv: 2606.14081 · v3 · pith:TD7DYQAMnew · submitted 2026-06-12 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Pith reviewed 2026-06-27 05:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGeess.IV
keywords landslide detectiongeospatial foundation modelsU-NetLoRAsemantic segmentationSentinel-2disaster mappingclass imbalance
0
0 comments X

The pith

Hybrid U-Net with Clay GFM context as auxiliary input reaches 64.5% F1 on landslide segmentation, beating both Clay-only and plain U-Net baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates Clay v1.5, a geospatial foundation model, as a way to improve pixel-level landslide segmentation on the highly imbalanced Landslide4Sense benchmark. It tests three setups: Clay as the main encoder with terrain fusion, a U-Net that receives Clay semantic features at the bottleneck, and a standard U-Net. The hybrid U-Net plus two-stage LoRA on Clay yields the highest test F1 of 64.5 percent, compared with 55.2 percent for Clay alone and 59.9 percent for the baseline. Clay underperforms when used standalone because it lacks multi-scale skip connections, yet its pretrained representations reliably help when added as extra context. The results indicate that foundation models deliver the most value for this task when they supplement rather than substitute for convolutional architectures that preserve spatial detail.

Core claim

The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

What carries the argument

Two-stage Low-Rank Adaptation (LoRA) that injects Clay semantic context into the U-Net bottleneck as auxiliary input while preserving the convolutional skip connections.

If this is right

  • GFMs improve landslide segmentation most when supplied as auxiliary context rather than used as the sole encoder.
  • Two-stage LoRA provides an effective way to adapt a pretrained GFM inside a hybrid segmentation pipeline.
  • Convolutional skip connections remain necessary for spatially precise outputs even when strong semantic features are available.
  • Pretrained geospatial representations help mitigate extreme class imbalance in post-event mapping tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid pattern may transfer to other geospatial segmentation problems that share high class imbalance and multi-band imagery.
  • Alternative fusion locations or other GFMs could be tested to see if bottleneck injection is optimal.
  • The performance gap suggests that real-time disaster pipelines could adopt lightweight LoRA adapters on existing CNN backbones without full model replacement.

Load-bearing premise

The measured F1 gains are caused by Clay's semantic context rather than by the specific two-stage LoRA schedule, random seed variation, or unstated differences in training protocol or data augmentation.

What would settle it

Retraining the hybrid architecture with the identical two-stage LoRA schedule but replacing Clay embeddings with random vectors or embeddings from an unrelated model and checking whether the F1 advantage disappears.

Figures

Figures reproduced from arXiv: 2606.14081 by Huong Binh Vu.

Figure 1
Figure 1. Figure 1: Geographic distribution of the Landslide4Sense training data across [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Exploratory data analysis. (A) Mean spectral signatures of landslide vs. background pixels across the 14 input bands. (B) Per-band separability index [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture 1: Clay as primary encoder with multi-scale terrain fusion. (A) The Clay v1.5 ViT encoder design from Kaushik [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture 2: Hybrid U-Net + Clay. The U-Net backbone encodes all 14 bands through four downsampling stages with full skip connections, while [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test-set performance of Architecture 2 (Model 7a, representative [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative predictions of Architecture 2 on six test chips ordered by decreasing positive pixel fraction (70.42%–4.99%; representative seed 42). [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Signed Grad-CAM fusion analysis for Architecture 2 on five test chips spanning dense to empty scar coverage. Columns: ground truth mask, binary [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates Clay v1.5, a geospatial foundation model, for pixel-level landslide segmentation on the Landslide4Sense benchmark (3,799 training chips, ~2% positive pixels). It compares three strategies: Clay as primary encoder with multi-scale residual terrain fusion, a standard U-Net baseline, and a hybrid U-Net that injects Clay features at the bottleneck. All use two-stage LoRA in the reported hybrid configuration. The hybrid achieves the highest test F1 of 64.5 ± 1.8% over three seeds, outperforming Clay-only (55.2 ± 3.6%) and U-Net (59.9%). The authors conclude that GFMs improve performance most effectively when used as auxiliary context to CNNs rather than as standalone encoders, owing to the absence of skip connections in the Clay backbone.

Significance. If the reported F1 gains can be attributed to Clay semantic context under controlled conditions, the work would offer concrete empirical guidance on hybrid GFM-CNN designs for class-imbalanced remote-sensing segmentation. The explicit multi-seed reporting with standard deviations and direct baseline comparisons is a positive methodological feature that strengthens reproducibility of the headline numbers.

major comments (2)
  1. [Abstract] Abstract: The central claim that the 4.6-point F1 gain of the hybrid over the U-Net baseline is caused by Clay semantic context requires that the U-Net baseline and Clay-only models received identical training protocols (two-stage LoRA schedule, optimizer, augmentation pipeline, and positive-pixel weighting). No such protocol equivalence is stated or demonstrated, rendering the attribution to Clay features unverifiable from the reported results.
  2. [Results] Results paragraph: No ablation is presented that isolates the contribution of the two-stage LoRA schedule from the Clay feature injection itself. Because the hybrid is the only configuration explicitly described as using two-stage LoRA, the observed delta could arise from the adaptation procedure rather than from the GFM context, directly undermining the conclusion that GFMs are “most effective when they complement” CNNs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for explicit protocol details and ablation clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the 4.6-point F1 gain of the hybrid over the U-Net baseline is caused by Clay semantic context requires that the U-Net baseline and Clay-only models received identical training protocols (two-stage LoRA schedule, optimizer, augmentation pipeline, and positive-pixel weighting). No such protocol equivalence is stated or demonstrated, rendering the attribution to Clay features unverifiable from the reported results.

    Authors: We agree that protocol equivalence must be stated explicitly for the attribution to be verifiable. All three configurations were trained under the same optimizer, augmentation pipeline, positive-pixel weighting, and epoch schedule; two-stage LoRA was applied to both the Clay-only and hybrid models (as required for GFM adaptation), while the U-Net baseline used standard fine-tuning. We will revise the manuscript to add an explicit methods paragraph and comparison table confirming these shared settings across models. revision: yes

  2. Referee: [Results] Results paragraph: No ablation is presented that isolates the contribution of the two-stage LoRA schedule from the Clay feature injection itself. Because the hybrid is the only configuration explicitly described as using two-stage LoRA, the observed delta could arise from the adaptation procedure rather than from the GFM context, directly undermining the conclusion that GFMs are “most effective when they complement” CNNs.

    Authors: The referee correctly notes that the current text only highlights two-stage LoRA for the hybrid, leaving open the possibility that LoRA itself drives part of the gain. Clay-only also uses LoRA for adaptation, and the hybrid still outperforms it, but we acknowledge the lack of a pure U-Net + LoRA control. We will revise the results section to clarify LoRA usage per model and add a short ablation (U-Net with two-stage LoRA, no Clay) if space allows; otherwise we will note this as a limitation and qualify the conclusion accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark comparison

full rationale

The manuscript reports F1 scores from training three segmentation models (hybrid U-Net+Clay, Clay-only, U-Net baseline) on the fixed Landslide4Sense dataset and evaluating on a held-out test set. No equations, derivations, or first-principles results appear. The central claim rests on direct experimental deltas against explicitly described baselines rather than any reduction of outputs to fitted inputs or self-citations. No self-definitional, fitted-prediction, or uniqueness-theorem patterns are present. This is standard empirical ML evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the public Landslide4Sense benchmark being representative and on standard supervised segmentation training assumptions; no new entities or ad-hoc constants are introduced beyond routine ML hyperparameters.

free parameters (1)
  • LoRA rank and learning rate schedule
    Two-stage LoRA adaptation introduces rank and learning-rate choices that are fitted during training; exact values not stated in abstract.
axioms (1)
  • domain assumption The Landslide4Sense dataset split and annotation protocol constitute a fair test of generalization for post-event landslide mapping.
    All reported numbers are computed on this single benchmark.

pith-pipeline@v0.9.1-grok · 5754 in / 1269 out tokens · 22746 ms · 2026-06-27T05:17:42.702724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Understanding fatal landslides at global scales: A summary of topographic, climatic, and anthropogenic perspectives,

    S. Fidanet al., “Understanding fatal landslides at global scales: A summary of topographic, climatic, and anthropogenic perspectives,”Nat. Hazards, vol. 120, pp. 6437–6455, 2024

  2. [2]

    Quantitative risk analysis for earthquake-induced landslides—Emamzadeh Ali, Iran,

    S. M. Mousaviet al., “Quantitative risk analysis for earthquake-induced landslides—Emamzadeh Ali, Iran,”Eng. Geol., vol. 122, no. 3–4, pp. 191– 203, 2011

  3. [3]

    Landslide susceptibility in cemented volcanic soils, Ask region, Iran,

    S. M. Mousavi, “Landslide susceptibility in cemented volcanic soils, Ask region, Iran,”Indian Geotech. J., vol. 47, no. 1, pp. 115–130, 2017

  4. [4]

    Landslides in a changing world,

    I. Alc´antara-Ayala, “Landslides in a changing world,”Landslides, vol. 22, pp. 2851–2865, 2025

  5. [5]

    Climate change could trigger more landslides in high mountain Asia,

    “Climate change could trigger more landslides in high mountain Asia,”NOAA Research, Feb. 11,

  6. [6]

    Available: https://research.noaa.gov/ climate-change-could-trigger-more-landslides-in-high-mountain-asia/

    [Online]. Available: https://research.noaa.gov/ climate-change-could-trigger-more-landslides-in-high-mountain-asia/

  7. [7]

    Rapid landslide detection from free optical satellite imagery using a robust change detection technique,

    R. Coluzziet al., “Rapid landslide detection from free optical satellite imagery using a robust change detection technique,”Sci. Rep., vol. 15, Art. no. 4697, 2025

  8. [8]

    Land- slide4Sense: Reference benchmark data and deep learning models for landslide detection,

    O. Ghorbanzadeh, Y . Xu, P. Ghamisi, M. Kopp, and D. Kreil, “Land- slide4Sense: Reference benchmark data and deep learning models for landslide detection,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–17, 2022. 9

  9. [9]

    Enhancing landslide detection in Western Ghats of Kerala, India with deep learning and explainable AI,

    A. Sreekumaret al., “Enhancing landslide detection in Western Ghats of Kerala, India with deep learning and explainable AI,”Sci. Rep., vol. 15, Art. no. 45151, 2025, doi: 10.1038/s41598-025-33065-9

  10. [10]

    Brief communication: AI-driven rapid landslide mapping following the 2024 Hualien earthquake in Taiwan,

    L. Navaet al., “Brief communication: AI-driven rapid landslide mapping following the 2024 Hualien earthquake in Taiwan,”Nat. Hazards Earth Syst. Sci., vol. 25, pp. 2371–2377, 2025, doi: 10.5194/nhess-25-2371- 2025

  11. [11]

    A feature fusion method on landslide identification in remote sensing with Segment Anything Model,

    C. Yanget al., “A feature fusion method on landslide identification in remote sensing with Segment Anything Model,”Landslides, vol. 22, pp. 471–483, 2025, doi: 10.1007/s10346-024-02390-x

  12. [12]

    Segment Anything,

    A. Kirillovet al., “Segment Anything,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023

  13. [13]

    Prithvi-EO-2.0: A versatile multi-temporal foundation model for Earth observation applications,

    D. Szwarcmanet al., “Prithvi-EO-2.0: A versatile multi-temporal foundation model for Earth observation applications,”arXiv preprint arXiv:2412.02732, Mar. 2026

  14. [14]

    Clay foundation model,

    Clay Foundation, “Clay foundation model,” 2024. [Online]. Available: https://clay-foundation.github.io/model/

  15. [15]

    Assessing the value of geo-foundational models for flood inundation mapping: Benchmarking models for Sentinel-1, Sentinel- 2, and PlanetScope for end-users,

    S. Kaushiket al., “Assessing the value of geo-foundational models for flood inundation mapping: Benchmarking models for Sentinel-1, Sentinel- 2, and PlanetScope for end-users,”arXiv preprint arXiv:2511.01990, Jan. 2026

  16. [16]

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” inProc. 33rd Int. Conf. Mach. Learn. (ICML), 2016, pp. 1050–1059

  17. [17]

    Saito, Z

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient- based localization,”Int. J. Comput. Vis., vol. 128, no. 2, pp. 336–359, 2020, doi: 10.1109/ICCV .2017.74

  18. [18]

    Landslide4Sense dataset (v1.0),

    IBM NASA Geospatial, “Landslide4Sense dataset (v1.0),” Hugging Face, 2024. [Online]. Available: https://huggingface.co/datasets/harshinde/ LandSlide4Sense

  19. [19]

    The Outcome of the 2022 Landslide4Sense Competition: Advanced Landslide Detection from Multi-Source Satellite Imagery,

    O. Ghorbanzadehet al., “The Outcome of the 2022 Landslide4Sense Competition: Advanced Landslide Detection from Multi-Source Satellite Imagery,”IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 15, pp. 9927–9942, 2022, doi: 10.1109/JSTARS.2022.3220845

  20. [20]

    Landslide4Sense-2022: Data description and baseline code for Land- Slide4Sense 2022 competition,

    Institute of Advanced Research in Artificial Intelligence (IARAI), “Landslide4Sense-2022: Data description and baseline code for Land- Slide4Sense 2022 competition,” GitHub, 2022. [Online]. Available: https://github.com/iarai/Landslide4Sense-2022

  21. [21]

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 1026–1034

  22. [22]

    The Lov\'asz Hinge: A Novel Convex Surrogate for Submodular Losses

    J. Yu and M. Blaschko, “The Lov ´asz hinge: A novel convex surrogate for submodular losses,”arXiv preprint arXiv:1512.07797, 2017

  23. [23]

    Two-stage training strategy combined with neural network for segmentation of internal mammary artery graft,

    S. Sunet al., “Two-stage training strategy combined with neural network for segmentation of internal mammary artery graft,”Biomed. Signal Process. Control, vol. 80, Art. no. 104278, Feb. 2023

  24. [24]

    C. J. Van Rijsbergen,Information Retrieval, 2nd ed. London, U.K.: Butterworths, 1979

  25. [25]

    Beyond temperature scaling: Obtaining well-calibrated multi- class probabilities with Dirichlet calibration,

    M. Kull, M. Perell ´o-Nieto, M. K ¨angsepp, T. Silva Filho, H. Song, and P. Flach, “Beyond temperature scaling: Obtaining well-calibrated multi- class probabilities with Dirichlet calibration,” inAdv. Neural Inf. Process. Syst., vol. 32, 2019

  26. [26]

    CTFNet: CNN-Transformer fusion network for remote-sensing image semantic segmentation,

    H. Wu, P. Huang, M. Zhang, and W. Tang, “CTFNet: CNN-Transformer fusion network for remote-sensing image semantic segmentation,”IEEE Geosci. Remote Sens. Lett., vol. 21, Art. no. 5000305, pp. 1–5, 2024, doi: 10.1109/LGRS.2023.3336061

  27. [27]

    DGCFNet: Dual Global Context Fusion Network for remote sensing image semantic segmentation,

    Y . Liao, T. Zhou, L. Li, J. Li, J. Shen, and A. Hamdulla, “DGCFNet: Dual Global Context Fusion Network for remote sensing image semantic segmentation,”PeerJ Comput. Sci., vol. 11, Art. no. e2786, 2025

  28. [28]

    Landslide segmentation with deep learning: Evaluating model generalization in rainfall-induced landslides in Brazil,

    L. P. Soares, H. C. Dias, G. P. B. Garcia, and C. H. Grohmann, “Landslide segmentation with deep learning: Evaluating model generalization in rainfall-induced landslides in Brazil,”Remote Sens., vol. 16, no. 22, Art. no. 4344, 2024

  29. [29]

    Z. Renet al., “Enhancing deep learning-based landslide detection from open satellite imagery via multisource data fusion of spectral, textural, and topographical features: A case study of old landslide detection in the Three Gorges Reservoir Area (TGRA),”Geomat. Nat. Hazards Risk, vol. 16, no. 1, Art. no. 2421224, 2025

  30. [30]

    Semi-automatic mapping of shallow landslides using free Sentinel-2 images and Google Earth Engine,

    D. Nottiet al., “Semi-automatic mapping of shallow landslides using free Sentinel-2 images and Google Earth Engine,”Nat. Hazards Earth Syst. Sci., vol. 23, no. 7, pp. 2625–2648, 2023, doi: 10.5194/nhess-23- 2625-2023

  31. [31]

    Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

    R. A. Burange, H. K. Shinde, and O. Mutyalwar, “Landslide detection and mapping using deep learning across multi-source satellite data and geographic regions,”arXiv preprint arXiv:2507.01123, 2025

  32. [32]

    Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

    I. Nasios, “Multi-modal landslide detection from Sentinel-1 SAR and Sentinel-2 optical imagery using multi-encoder vision transformers and ensemble learning,”arXiv preprint arXiv:2604.05959, Apr. 2026. Binh Huong Vureceived a B.A. in Economics with a minor in Computer Science from Harvard University. Her research interests lie at the intersection of machi...