Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

Ioannis Nasios

arxiv: 2604.05959 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.LG

Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

Ioannis Nasios This is my paper

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords landslide detectionSentinel-1Sentinel-2vision transformersensemble learningmulti-modal fusionSARoptical imagery

0 comments

The pith

Fusing Sentinel-1 SAR radar data with Sentinel-2 optical imagery through separate vision transformer encoders and an ensemble of neural networks with gradient boosting models produces an F1 score of 0.919 for detecting landslides in image 0

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that processes Sentinel-1 radar and Sentinel-2 optical satellite images using separate vision transformer encoders for each type of data before combining their outputs. This multi-encoder approach, paired with an ensemble of neural networks and gradient boosting models like LightGBM and XGBoost, allows accurate identification of landslides in image patches. Adding spectral indices such as NDVI further helps detect changes in vegetation and surface features. The result is a system that works even without prior images for comparison, making it practical for quick response in disaster situations where historical data might be missing or inconsistent.

Core claim

The authors claim that by using dedicated encoders for SAR and optical modalities within a vision transformer setup and then applying ensemble learning, their method can fuse the strengths of both data types to achieve an F1 score of 0.919 in classifying patches as containing landslides or not, all without the need for pre-event optical imagery that is typically used in change detection.

What carries the argument

Multi-encoder vision transformers that route each data modality through its own lightweight pretrained encoder, combined with ensemble integration of neural networks and gradient boosting classifiers along with derived spectral indices.

Load-bearing premise

The complementary strengths of SAR and optical data can be effectively combined through separate encoders and ensembles to detect landslides accurately even without pre-event images for comparison.

What would settle it

Testing the full multi-modal model against single-modality versions on an independent landslide dataset from a new region and finding no meaningful F1 improvement from the fusion would indicate that the claimed benefit of combining the data types does not hold.

Figures

Figures reproduced from arXiv: 2604.05959 by Ioannis Nasios.

**Figure 1.** Figure 1: (a) Sentinel-2 RGB image, (b) Sentinel-1 pseudo-RGB SAR composite image 2.1.2. Feature Engineering In addition to the 12 channels provided in the original dataset, six supplementary bands (bands 12-17) were created to expand the feature space and support some of the models in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Flowchart were employed to enhance model generalization, while class imbalance was again handled with a scale_pos_weight of 1.5. Additional regularization was provided by L2 (reg_lambda = 0.3) to prevent overfitting. Both GBMs followed very similar design philosophies and achieved comparable performance on the datasets ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Model with 3 input modalities 2.3.4. Calibration of Ensemble Outputs The final classification outputs were derived by averaging the probability estimates produced by all models included in the ensemble. OOF predictions were subsequently employed to guide the calibration process, enabling both refinement of the ensembling strategy and determination of the optimal decision threshold. A threshold value of 0.4… view at source ↗

**Figure 4.** Figure 4: OOF score per Threshold [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Feature Importance et al. (2024) explored the detection of old landslides in the Three Gorges Reservoir Area through a deep learning framework that incorporated Sentinel-2 imagery and topographical features, achieving an F1-score of 0.9105. While these studies were conducted on different datasets, the methodology presented in this work demonstrates superior performance, achieving an overall F1-score of 0.9… view at source ↗

read the original abstract

Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel-landslide-cls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical, code-ready multi-modal pipeline for patch-based landslide detection from free Sentinel SAR and optical data that hits 0.919 F1 on a competition set.

read the letter

The core contribution is a modular setup that feeds SAR and optical patches through separate lightweight pretrained ViT encoders, adds NDVI and similar indices, then ensembles the neural outputs with LightGBM and XGBoost. It works without pre-event imagery and treats the problem as patch classification rather than pixel segmentation, which matches many operational monitoring needs where you just need to flag areas quickly and under clouds. The public GitHub code and the reported competition win are real pluses; the flexibility to run optical-only, SAR-only, or combined is useful for deployment. The F1 score looks strong on the given data and the approach is a straightforward extension of existing multi-modal remote-sensing work rather than a big conceptual leap. The main gaps are in the validation details. The abstract and stress-test note do not show dataset size, train/test split strategy, cross-validation, error bars, or ablations that quantify how much the ensemble or the SAR branch actually adds. Without those, it is difficult to judge whether the 0.919 will hold on new regions or different sensors. Generalization beyond the competition data remains the standard open question for this kind of empirical remote-sensing paper. This work is aimed at practitioners who need a ready pipeline for landslide or similar hazard mapping using public Sentinel data. A reader building operational tools or running quick experiments will find it worth looking at; someone hunting for novel fusion theory will not. I would send it to peer review because the task is relevant, the code is public, and the empirical claim is concrete enough for referees to evaluate the methods and request the missing validation steps.

Referee Report

2 major / 3 minor

Summary. The manuscript describes a multi-modal approach for detecting landslides using Sentinel-1 SAR and Sentinel-2 optical imagery. It utilizes separate lightweight pretrained vision transformer encoders for each modality, incorporates spectral indices like NDVI, and employs an ensemble learning strategy combining neural networks with LightGBM and XGBoost. The method achieves an F1 score of 0.919 on a patch-based classification task without pre-event Sentinel-2 data and performed well in a machine learning competition. Public code is provided.

Significance. This work has potential significance in providing a flexible and effective framework for landslide detection that leverages complementary information from SAR and optical sensors without requiring pre-event imagery. The modular design allows for different input configurations, which could be useful for operational disaster response and broader environmental monitoring applications. The public availability of the code supports reproducibility.

major comments (2)

[Results] Results section: The reported F1 score of 0.919 is presented without accompanying details on the dataset size, validation strategy, or variability across runs, which is necessary to substantiate the state-of-the-art claim and assess generalizability beyond the competition dataset.
[Methods] Methods section: While the use of multi-encoder ViTs is described, there is insufficient information on the exact fusion mechanism between the SAR and optical feature representations prior to the ensemble step, which is central to demonstrating the benefit of multi-modal integration.

minor comments (3)

[Abstract] The abstract mentions 'lightweight pretrained encoders' but does not specify which ViT variants or pretraining datasets are used; this should be clarified in the methods.
No discussion of potential limitations, such as sensitivity to SAR speckle noise or optical cloud cover, is included.
[Introduction] The competition name or reference should be included for better context on the performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We have addressed the two major comments point by point below and will incorporate the requested clarifications and details into the revised manuscript.

read point-by-point responses

Referee: [Results] Results section: The reported F1 score of 0.919 is presented without accompanying details on the dataset size, validation strategy, or variability across runs, which is necessary to substantiate the state-of-the-art claim and assess generalizability beyond the competition dataset.

Authors: We agree that additional details would strengthen the results section and help substantiate the performance claims. The current manuscript references the competition dataset and the patch-based task but does not provide explicit statistics. In the revision we will add the total number of patches, class distribution, and the precise train/validation/test split ratios. We will also describe the validation strategy (hold-out from the competition data) and include performance variability by reporting mean F1 and standard deviation across multiple independent runs with different random seeds. These additions will appear in a new or expanded paragraph in the results section. revision: yes
Referee: [Methods] Methods section: While the use of multi-encoder ViTs is described, there is insufficient information on the exact fusion mechanism between the SAR and optical feature representations prior to the ensemble step, which is central to demonstrating the benefit of multi-modal integration.

Authors: We acknowledge that the fusion step between the modality-specific encoders could be described more explicitly. The manuscript outlines separate lightweight ViT encoders for SAR and optical inputs, but the precise manner in which their feature representations are combined before being passed to the ensemble (NN, LightGBM, XGBoost) is not detailed enough. In the revised methods section we will specify the fusion operation (including feature dimensions, concatenation or projection steps), how the fused representation is prepared for each ensemble member, and any ablation or analysis that isolates the contribution of the multi-modal fusion. A clarifying diagram or pseudocode will be added if space permits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical performance claim

full rationale

The paper presents a modular multi-encoder ViT + ensemble (LightGBM/XGBoost) pipeline for patch-based landslide classification on Sentinel-1/2 data, reporting an F1 of 0.919 on a competition dataset without pre-event imagery. All load-bearing steps are standard supervised training and fusion of modalities; no equations, uniqueness theorems, or predictions reduce by construction to fitted inputs or self-citations. Public code is supplied, making the result externally verifiable on the given split. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The central claim rests on standard assumptions of supervised learning (i.i.d. train/test splits, appropriate loss functions) and on the availability of the competition dataset. No new physical axioms or invented entities are introduced.

pith-pipeline@v0.9.0 · 5593 in / 1166 out tokens · 45420 ms · 2026-05-10T18:28:46.536381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-encoder vision transformers... ensemble of neural networks and gradient boosting models (LightGBM and XGBoost)... F1 score of 0.919 on landslide detection, addressing a patch-based classification task
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Derived spectral indices, such as NDVI... statistical descriptors... 126 features per patch

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection
cs.CV 2026-06 unverdicted novelty 5.0

Hybrid U-Net augmented with Clay GFM context via two-stage LoRA reaches 64.5% test F1 on Landslide4Sense, beating both standalone Clay (55.2%) and plain U-Net (59.9%).

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A lightweight dual-stream attention network for real-time landslide monitor- ing in multi-modal remote sensing imagery. Remote Sensing Applications: Society and Environment , 101732. Di Martire, D., Tessitore, S., Brancato, D., Ciminelli, M.G., Costabile, S., Costantini, M., Graziano, G.V ., Minati, F., Ramondini, M., Calcaterra, D., 2016. Landslide detec...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019

An integrated approach of machine learning and remote sensing for evaluating landslide hazards and risk hotspots, nw himalaya. Remote Sensing Applications: Society and Environment 33, 101140. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y ., 2022. Maxvit: Multi-axis vision transformer, in: European conference on computer vi- sion,...

work page doi:10.5281/zenodo.4414861 2022

[1] [1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A lightweight dual-stream attention network for real-time landslide monitor- ing in multi-modal remote sensing imagery. Remote Sensing Applications: Society and Environment , 101732. Di Martire, D., Tessitore, S., Brancato, D., Ciminelli, M.G., Costabile, S., Costantini, M., Graziano, G.V ., Minati, F., Ramondini, M., Calcaterra, D., 2016. Landslide detec...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019

An integrated approach of machine learning and remote sensing for evaluating landslide hazards and risk hotspots, nw himalaya. Remote Sensing Applications: Society and Environment 33, 101140. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y ., 2022. Maxvit: Multi-axis vision transformer, in: European conference on computer vi- sion,...

work page doi:10.5281/zenodo.4414861 2022