Recognition: unknown
A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images
Pith reviewed 2026-05-07 07:04 UTC · model grok-4.3
The pith
A pre-training strategy on ImageNet guides models to better generalize to semantic segmentation of remotely sensed images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a generalised pre-training strategy on ImageNet, deep learning models can be guided away from learning domain-specific features, leading to improved generalization when fine-tuned for semantic segmentation on remotely sensed images. This results in state-of-the-art performance on the iSAID (67.4% mIoU), MFNet (56.9% mIoU), PST900 (84.22% mIoU), and Potsdam (91.88% mF1) datasets.
What carries the argument
The generalised pre-training strategy that steers the model away from domain-specific features in the pre-training dataset during the pre-training phase.
If this is right
- Pre-trained models become more adaptable to remote sensing datasets with varying scenes and modalities without custom pre-training data.
- Significant reduction in effort needed to create domain-specific pre-training datasets for remote sensing applications.
- Potential for developing a single foundation model that works across computer vision and remote sensing domains.
- Consistent accuracy gains demonstrated across multiple independent datasets.
Where Pith is reading between the lines
- Applying the same strategy to other pre-training datasets or domains with large gaps could yield similar benefits.
- Once the concrete implementation details are known, the approach could be tested on additional segmentation benchmarks.
- Future work might explore whether the gains come from reduced overfitting to natural image statistics or from some other regularization effect.
Load-bearing premise
The proposed pre-training strategy can guide models away from learning domain-specific features even though the exact mechanism and implementation details are not described.
What would settle it
Running controlled experiments that compare the proposed strategy against standard ImageNet pre-training on the same models and datasets and finding no improvement in segmentation accuracy would falsify the central claim.
read the original abstract
In the segmentation of remotely sensed images, deep learning models are typically pre-trained using large image databases like ImageNet before fine-tuned on domain-specific datasets. However, the performance of these fine-tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet's images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large-scale domain-specific image datasets for pre-training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generaliza-bility to other application scenarios. To address these issues, this study introduces a novel yet simple pre-training strategy designed to guide a model away from learning domain-specific features in a pre-training dataset during pre-training, thereby improving the generalisation ability of the pre-trained model. To evaluate the strategy's effectiveness, deep learning models are pre-trained on ImageNet and subsequently fine-tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre-training strategy led to state-of-the-art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a novel yet simple pre-training strategy for deep learning models used in semantic segmentation of remotely sensed images. The strategy is intended to guide models pre-trained on ImageNet away from learning domain-specific features, thereby improving generalization when fine-tuned on remote sensing datasets. The authors evaluate the approach on four datasets (iSAID, MFNet, PST900, and Potsdam) and report state-of-the-art results: 67.4% mIoU on iSAID, 56.9% mIoU on MFNet, 84.22% mIoU on PST900, and 91.88% mF1 on Potsdam. The work aims to avoid the need for large domain-specific pre-training datasets while still leveraging ImageNet.
Significance. If the central claim holds after proper validation, the result would be significant for the field. It addresses a practical challenge in remote sensing where domain gaps between natural images and overhead imagery limit transfer learning, and it offers a potential path toward more generalizable models without the cost of curating new large-scale domain-specific datasets. This could support development of unified foundation models spanning computer vision and remote sensing. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described in the manuscript.
major comments (3)
- [Abstract and Method section] Abstract and Method section: The central claim rests on a 'novel yet simple pre-training strategy' that guides the model away from learning domain-specific (ImageNet) features. However, no equations, loss functions, pseudocode, auxiliary losses, feature regularization terms, or implementation details are provided to describe the concrete mechanism. Without this, it is impossible to determine whether the reported gains arise from the claimed strategy or from unmentioned factors such as training schedule, augmentations, or model variants.
- [Experimental Results section] Experimental Results section: The manuscript reports SOTA mIoU/mF1 numbers on four datasets but supplies no baseline comparisons to prior SOTA methods, no ablation studies isolating the effect of the pre-training strategy, no error bars, and no statistical significance tests. This makes it impossible to attribute the improvements (e.g., 67.4% mIoU on iSAID) specifically to the proposed strategy rather than other experimental choices.
- [Abstract] Abstract: The claim that the strategy 'improves the generalisation ability of the pre-trained model' is load-bearing for the entire narrative, yet the text provides no quantitative evidence (such as feature visualization, domain discrepancy metrics, or comparison of learned representations) that the model is indeed guided away from ImageNet-specific features.
minor comments (2)
- [Abstract] Abstract: The Potsdam result is reported in mF1 while the others use mIoU; clarify the choice of metric and whether it is consistent with prior work on that dataset.
- [Introduction] Introduction: The discussion of 'large domain gaps' would benefit from specific references to prior domain-adaptation or transfer-learning studies in remote sensing to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address each of the major comments below and commit to making the necessary revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Method section] Abstract and Method section: The central claim rests on a 'novel yet simple pre-training strategy' that guides the model away from learning domain-specific (ImageNet) features. However, no equations, loss functions, pseudocode, auxiliary losses, feature regularization terms, or implementation details are provided to describe the concrete mechanism. Without this, it is impossible to determine whether the reported gains arise from the claimed strategy or from unmentioned factors such as training schedule, augmentations, or model variants.
Authors: We agree that the current description of the pre-training strategy in the Method section is high-level and lacks formal details. The strategy is designed to minimize the learning of ImageNet-specific features by incorporating specific training protocols during pre-training, but we recognize the need for explicit specification. In the revised manuscript, we will add a detailed algorithmic description, pseudocode, and any auxiliary components or implementation specifics to allow full reproducibility and to clarify how the gains are achieved. revision: yes
-
Referee: [Experimental Results section] Experimental Results section: The manuscript reports SOTA mIoU/mF1 numbers on four datasets but supplies no baseline comparisons to prior SOTA methods, no ablation studies isolating the effect of the pre-training strategy, no error bars, and no statistical significance tests. This makes it impossible to attribute the improvements (e.g., 67.4% mIoU on iSAID) specifically to the proposed strategy rather than other experimental choices.
Authors: We acknowledge the importance of rigorous experimental validation. The current manuscript focuses on reporting the achieved SOTA results but does not include the requested comparisons and analyses. We will revise the Experimental Results section to include direct comparisons with prior state-of-the-art methods on each dataset, ablation studies that isolate the contribution of the pre-training strategy (e.g., with and without the strategy), results with standard error bars from multiple random seeds, and appropriate statistical tests to demonstrate significance. revision: yes
-
Referee: [Abstract] Abstract: The claim that the strategy 'improves the generalisation ability of the pre-trained model' is load-bearing for the entire narrative, yet the text provides no quantitative evidence (such as feature visualization, domain discrepancy metrics, or comparison of learned representations) that the model is indeed guided away from ImageNet-specific features.
Authors: The manuscript's abstract and results emphasize the performance improvements as evidence of better generalization. However, we agree that direct evidence supporting the mechanism—i.e., that the model learns fewer domain-specific features—is currently absent. In the revision, we will add analyses such as feature visualizations (e.g., t-SNE embeddings of features extracted from ImageNet and remote sensing images), quantitative domain discrepancy measures (like Maximum Mean Discrepancy), and comparisons of representation similarities to substantiate the claim that the strategy guides the model away from ImageNet-specific features. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivation chain
full rationale
The paper asserts a novel pre-training strategy that 'guides a model away from learning domain-specific features' during ImageNet pre-training, then reports empirical SOTA results (67.4% mIoU on iSAID, etc.) after fine-tuning on four remote-sensing datasets. No equations, loss functions, uniqueness theorems, fitted parameters, or self-citations are invoked as a deductive chain. The central narrative is an experimental comparison whose validity rests on the reported accuracies and implementation details (which the abstract leaves undescribed), not on any reduction of a 'prediction' to its own inputs. This is a standard empirical ML paper; the absence of a mathematical derivation means there is nothing that can be circular by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in deep learning pre-training and fine-tuning for transfer learning hold.
Reference graph
Works this paper leans on
-
[1]
Aryal, J., Sitaula, C., Aryal, S.: NDVI Threshold -Based Urban Green Space Mapping from Sentinel -2A at the Local Governmen tal Area (LGA) Level of Victoria, Australia. Land (Basel). 11, 351 (2022). https://doi.org/10.3390/land11030351
-
[2]
Zhu, Q., Cai, Y., Fang, Y., Yang, Y., Chen, C., Fan, L., Nguyen, A.: Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon. 10, e38495 (2024). https://doi.org/10.1016/j.heli- yon.2024.e38495
-
[3]
IEEE J Sel Top Appl Earth Obs Remote Sens
Zhu, Q., Fang, Y., Cai, Y., Chen, C., Fan, L.: Rethinking Scanning Strategies with Vision Mamba in Semantic Segmentation of Remote Sensing Imagery: An Experimental Study. IEEE J Sel Top Appl Earth Obs Remote Sens. 1–14 (2024). https://doi.org/10.1109/JSTARS.2024.3472296
-
[4]
Rajbhandari, S., Aryal, J., Osborn, J., Lucieer, A., Musk, R.: Leveraging Machine Learning to Extend Ontology- Driven Geographic Object-Based Image Analysis (O-GEOBIA): A Case Study in Forest-Type Mapping. Remote Sens (Basel). 11, 503 (2019). https://doi.org/10.3390/rs11050503
-
[5]
Cai, Y., Huang, H., Wang, K., Zhang, C., Fan, L., Guo, F.: Selecting Optimal Combination of Data Channels for Semantic Segment ation in City Information Modelling (CIM). Remote Sens (Basel). 13, 1367 (2021). https://doi.org/10.3390/rs13071367
-
[6]
IEEE Transactions on Geoscience and Re- mote Sensing
Cai, Y., Fan, L., Atkinson, P.M., Zhang, C.: Semantic Segmentation of Terrestrial Laser Scanning Point Clouds Using Locally Enhanced Image-Based Geometric Representations. IEEE Transactions on Geoscience and Re- mote Sensing. 60, 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3161982
-
[7]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern- stein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis. 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
-
[8]
Hao, J., Chen, S.: Language -aware multiple datasets detection pretraining for DETRs. Neural Networks. 179, 106506 (2024). https://doi.org/10.1016/j.neunet.2024.106506
-
[9]
Shirmard, H., Farahbakhsh, E., Müller, R.D., Chandra, R.: A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens Environ. 268, 112750 (2022). https://doi.org/10.1016/j.rse.2021.112750
-
[10]
IEEE J Sel Top Appl Earth Obs Remote Sens
Fang, Y., Cai, Y., Fan, L.: SDRCNN: A Single-Scale Dense Residual Connected Convolutional Neural Net- work for Pansharpening. IEEE J Sel Top Appl Earth Obs Remote Sens. 16, 6325–6338 (2023). https://doi.org/10.1109/JSTARS.2023.3292320
-
[11]
IEEE Transactions on Geoscience and Remote Sensing
Sun, X., Wang, P., Lu, W., Zhu, Z., Lu, X., He, Q., Li, J., Rong, X., Yang, Z., Chang, H., He, Q., Yang, G., Wang, R., Lu, J., Fu, K.: RingMo: A Remote Sensing Foundation Model With Masked Image Modeling. IEEE Transactions on Geoscience and Remote Sensing. 61, 1–22 (2023). https://doi.org/10.1109/TGRS.2022.3194732
-
[12]
IEEE Trans Pattern Anal Mach Intell
Jing, L., Tian, Y.: Self -Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans Pattern Anal Mach Intell. 43, 4037–4058 (2021). https://doi.org/10.1109/TPAMI.2020.2992393
-
[13]
Cai, Y., Aryal, J., Fang, Y., Huang, H., Fan, L.: OSTA: One -shot Task-adaptive Channel Selection for Se- mantic Segmentation of Multichannel Images. (2023)
2023
-
[14]
Waqas Zamir, S., Arora, A., Gupta, A., Khan, S., Sun, G., Shahbaz Khan, F., Zhu, F., Shao, L., Xia, G.- S., Bai, X.: iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images, (2019)
2019
-
[15]
In: 2017 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., Harada, T.: MFNet: Towards real-time semantic segmenta- tion for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 5108–5115. IEEE (2017). https://doi.org/10.1109/IROS.2017.8206396
-
[16]
In: 2020 IEEE International Conference on Robotics and Auto- mation (ICRA)
Shivakumar, S.S., Rodrigues, N., Zhou, A., Miller, I.D., Kumar, V., Taylor, C.J.: PST900: RGB -Thermal Calibration, Dataset and Segmentation Network. In: 2020 IEEE International Conference on Robotics and Auto- mation (ICRA). pp. 9441–9447. IEEE (2020). https://doi.org/10.1109/ICRA40945.2020.9196831
-
[17]
IEEE Transactions on Geoscience and Remote Sensing
Cai, Y., Fan, L., Fang, Y.: SBSS: Stacking-Based Semantic Segmentation Framework for Very High-Resolu- tion Remote Sensing Image. IEEE Transactions on Geoscience and Remote Sensing. 61, 1–14 (2023). https://doi.org/10.1109/TGRS.2023.3234549. 10
-
[18]
In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9992–10002. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00986
-
[19]
Liu, Z., Mao, H., Wu, C. -Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11966–11976. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01167
-
[20]
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified Perceptual Parsing for Scene Understanding, (2018)
2018
-
[21]
IEEE Transactions on Geoscience and Remote Sensing
Muhtar, D., Zhang, X., Xiao, P., Li, Z., Gu, F.: CMID: A Unified Self -Supervised Learning Framework for Remote Sensing Image Understanding. IEEE Transactions on Geoscience and Remote Sensing. 61, 1–17 (2023). https://doi.org/10.1109/TGRS.2023.3268232
-
[22]
Zhu, Q., Fan, L., Weng, N.: Advancements in point cloud data augmentation for deep learning: A survey. Pattern Recognit. 153, 110532 (2024). https://doi.org/10.1016/j.patcog.2024.110532
-
[23]
IEEE Transactions on Geoscience and Remote Sensing
Wang, D., Zhang, J., Du, B., Xia, G.-S., Tao, D.: An Empirical Study of Remote Sensing Pretraining. IEEE Transactions on Geoscience and Remote Sensing. 61, 1–20 (2023). https://doi.org/10.1109/TGRS.2022.3176603
-
[24]
Cai, Y., Fan, L., Zhang, C.: Semantic Segmentation of Multispectral Images via Linear Compression of Bands: An Experiment Using RIT-18. Remote Sens (Basel). 14, 2673 (2022). https://doi.org/10.3390/rs14112673
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.