A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series
Pith reviewed 2026-05-23 07:42 UTC · model grok-4.3
The pith
TSViT slightly surpasses 3D U-Net for crop segmentation from satellite time series, with VistaFormer offering best efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS.
What carries the argument
Different strategies for capturing temporal dependencies in transformer architectures versus 3D convolutional networks for processing multispectral time series data.
If this is right
- Temporal modelling is critical for satellite image time series tasks.
- Transformers that explicitly model temporal dynamics outperform those that treat time as an additional spatial dimension.
- TSViT outperforms the tested CNN models on the given datasets.
- VistaFormer provides a strong efficiency-performance trade-off.
Where Pith is reading between the lines
- The observed superiority of TSViT may not hold for other geographic regions or sensor types not tested here.
- VistaFormer's efficiency could enable deployment on resource-constrained systems for large-area monitoring.
- Extending the comparison to include more recent transformer variants or hybrid models could refine the efficiency-accuracy frontier.
Load-bearing premise
The Munich and Lombardia datasets together with the chosen training and evaluation protocols are representative enough that the observed ranking of models will generalize to other regions, sensors, or crop types.
What would settle it
Evaluating the models on a new Sentinel-2 dataset from a different agricultural region where 3D U-Net or another CNN achieves higher accuracy than TSViT would falsify the claim that TSViT is generally superior.
Figures
read the original abstract
Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical comparison of CNN-based (3D U-Net, 3D FPN, 3D DeepLabv3) and transformer-based (Swin UNETR, TSViT, VistaFormer) segmentation models for crop mapping from Sentinel-2 satellite image time series. On the Munich and Lombardia datasets, TSViT achieves the highest overall performance (slightly above the 3D U-Net baseline), VistaFormer offers the best efficiency, and explicit temporal modeling is identified as critical, with approaches treating time as an extra spatial dimension performing less well.
Significance. If the reported ranking is robust, the study supplies actionable guidance for architecture selection in SITS-based agricultural monitoring, particularly the benefit of dedicated temporal mechanisms over 3D convolutions and the efficiency of VistaFormer. The work is a straightforward empirical ranking with no parameter-free derivations or machine-checked proofs.
major comments (1)
- [Abstract] Abstract: the central claim that TSViT is best overall and that explicit temporal modeling is critical rests exclusively on results from the Munich and Lombardia datasets. Both are mid-latitude European Sentinel-2 scenes with overlapping crop calendars; no cross-region, cross-sensor, or cross-crop-type experiments are described that would separate architecture effects from data-distribution effects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the scope of our empirical comparison. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that TSViT is best overall and that explicit temporal modeling is critical rests exclusively on results from the Munich and Lombardia datasets. Both are mid-latitude European Sentinel-2 scenes with overlapping crop calendars; no cross-region, cross-sensor, or cross-crop-type experiments are described that would separate architecture effects from data-distribution effects.
Authors: We agree that the reported ranking and the emphasis on explicit temporal modeling are derived solely from the Munich and Lombardia datasets, which share similar mid-latitude European characteristics and crop calendars. The abstract already names these datasets, but we acknowledge that the central claims would benefit from clearer qualification regarding generalizability. In the revised version we will (1) update the abstract to state that the performance ordering holds on these two Sentinel-2 scenes and (2) add a short limitations paragraph noting that architecture effects have not been isolated from data-distribution effects and that cross-region or cross-sensor validation remains future work. We believe the comparative results still offer actionable guidance for similar agricultural monitoring settings. revision: yes
Circularity Check
No circularity: purely empirical model ranking on fixed datasets
full rationale
The paper reports an empirical comparison of off-the-shelf CNN and transformer segmentation architectures (3D U-Net, TSViT, VistaFormer, etc.) on the Munich and Lombardia Sentinel-2 datasets. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. All claims reduce to direct performance metrics on the chosen data splits; the ranking is therefore not forced by construction or by prior self-referential results. This is the normal, non-circular outcome for a benchmark study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A review of deep learning methods for semantic segmentation of remote sensing imagery
Xiaohui Yuan, Jianfang Shi, and Lichuan Gu. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications , 169:114417, 2021
work page 2021
-
[2]
Convolutional neural networks based potholes detection using thermal imaging
Aparna, Yukti Bhatia, Rachna Rai, Varun Gupta, Naveen Aggarwal, and Aparna Akula. Convolutional neural networks based potholes detection using thermal imaging. Journal of King Saud University - Computer and Information Sciences, 34(3):578–588, 2022
work page 2022
-
[3]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021
work page 2021
-
[5]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing
work page 2020
-
[6]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6881–6890, June 2021
work page 2021
-
[7]
Yuan Yuan, Lei Lin, Qingshan Liu, Renlong Hang, and Zeng-Guang Zhou. Sits-former: A pre-trained spatio- spectral-temporal representation model for sentinel-2 time series classification. International Journal of Applied Earth Observation and Geoinformation , 106:102651, 2022
work page 2022
-
[8]
Ctgan : Cloud transformer generative adversarial network
Gi-Luen Huang and Pei-Yuan Wu. Ctgan : Cloud transformer generative adversarial network. In 2022 IEEE International Conference on Image Processing (ICIP) , pages 511–515, 2022
work page 2022
-
[9]
Libo Wang, RUI LI, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter Atkinson. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery.ISPRS Journal of Photogrammetry and Remote Sensing , 190:196–214, 06 2022
work page 2022
-
[10]
Ji He, Lina Zhao, Hongwei Yang, Mengmeng Zhang, and Wei Li. Hsi-bert: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Transactions on Geoscience and Remote Sensing, 58(1):165–178, 2020
work page 2020
-
[11]
Spectral- former: Rethinking hyperspectral image classification with transformers
Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectral- former: Rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2022
work page 2022
-
[12]
Al Rahhal, Reham Al Dayil, and Naif Al Ajlan
Yakoub Bazi, Laila Bashmal, Mohamad M. Al Rahhal, Reham Al Dayil, and Naif Al Ajlan. Vision transformers for remote sensing image classification. Remote Sensing, 13(3), 2021
work page 2021
-
[13]
Fariba Mohammadimanesh, Bahram Salehi, Masoud Mahdianpari, Brian Brisco, and Mahdi Motagh. Multi- temporal, multi-frequency, and multi-polarization coherence and sar backscatter analysis of wetlands. ISPRS Journal of Photogrammetry and Remote Sensing , 142:78–93, 2018
work page 2018
-
[14]
A deep learning approach for burned area segmentation with sentinel-2 data
Lisa Knopp, Marc Wieland, Michaela Rättich, and Sandro Martinis. A deep learning approach for burned area segmentation with sentinel-2 data. Remote Sensing, 12(15), 2020
work page 2020
-
[15]
Cheng-Chien Liu, Yu-Cheng Zhang, Pei-Yin Chen, Chien-Chih Lai, Yi-Hsin Chen, Ji-Hong Cheng, and Ming- Hsun Ko. Clouds classification from sentinel-2 imagery with deep residual learning and semantic image segmen- tation. Remote Sensing, 11(2), 2019
work page 2019
-
[16]
Thomas James, Calogero Schillaci, and Aldo Lipani. Convolutional neural networks for water segmentation using sentinel-2 red, green, blue (rgb) composites and derived spectral indices. International Journal of Remote Sensing , 42(14):5338–5365, 2021
work page 2021
-
[17]
Ignazio Gallo, Riccardo La Grassa, Nicola Landro, and Mirco Boschetti. Sentinel 2 time series analysis with 3d feature pyramid network and time domain class activation intervals for crop mapping. ISPRS International Journal of Geo-Information, 10(7), 2021. 7
work page 2021
-
[18]
Ali Jamali and Masoud Mahdianpari. Swin transformer and deep convolutional neural networks for coastal wetland classification using sentinel-1, sentinel-2, and lidar data. Remote Sensing, 14(2), 2022
work page 2022
-
[19]
Ali Jamali, Masoud Mahdianpari, Fariba Mohammadimanesh, and Saeid Homayouni. A deep learning framework based on generative adversarial networks and vision transformer for complex wetland classification using limited training samples. International Journal of Applied Earth Observation and Geoinformation , 115:103095, 2022
work page 2022
-
[20]
Agnès Bégué, Damien Arvor, Beatriz Bellon, Julie Betbeder, Diego De Abelleyra, Rodrigo P. D. Ferraz, Valentine Lebourgeois, Camille Lelong, Margareth Simões, and Santiago R. Verón. Remote sensing and cropping practices: A review. Remote Sensing, 10(1), 2018
work page 2018
-
[21]
Shyamal Virnodkar, V . K. Pachghare, and Sagar Murade. A technique to classify sugarcane crop from sentinel-2 satellite imagery using u-net architecture. In Chhabi Rani Panigrahi, Bibudhendu Pati, Prasant Mohapatra, Rajkumar Buyya, and Kuan-Ching Li, editors, Progress in Advanced Computing and Intelligent Engineering , pages 322–330, Singapore, 2021. Spri...
work page 2021
-
[22]
Crop type mapping by using transfer learning
Artur Nowakowski, John Mrziglod, Dario Spiller, Rogerio Bonifacio, Irene Ferrari, Pierre Philippe Mathieu, Manuel Garcia-Herranz, and Do-Hyung Kim. Crop type mapping by using transfer learning. International Journal of Applied Earth Observation and Geoinformation , 98:102313, 2021
work page 2021
-
[23]
Multi-temporal land cover classification with sequential recurrent encoders
Marc Rußwurm and Marco Körner. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS International Journal of Geo-Information , 7(4):129, 2018
work page 2018
-
[24]
Cctnet: Coupled cnn and transformer network for crop segmentation of remote sensing images
Hong Wang, Xianzhong Chen, Tianxiang Zhang, Zhiyong Xu, and Jiangyun Li. Cctnet: Coupled cnn and transformer network for crop segmentation of remote sensing images. Remote Sensing, 14(9):1956, 2022
work page 1956
-
[25]
Bowen Niu, Quanlong Feng, Boan Chen, Cong Ou, Yiming Liu, and Jianyu Yang. Hsi-transunet: A transformer based semantic segmentation model for crop mapping from uav hyperspectral imagery.Computers and Electronics in Agriculture, 201:107297, 2022
work page 2022
-
[26]
Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images, 2022
Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images, 2022
work page 2022
-
[27]
In-season and dynamic crop mapping using 3d convolution neural networks and sentinel-2 time series
Ignazio Gallo, Luigi Ranghetti, Nicola Landro, Riccardo La Grassa, and Mirco Boschetti. In-season and dynamic crop mapping using 3d convolution neural networks and sentinel-2 time series. ISPRS Journal of Photogrammetry and Remote Sensing, 195:335–352, 2023
work page 2023
-
[28]
Convolutional and transformer network for crop segmentation of sentinel-2 images
Mattia Gatti. Convolutional and transformer network for crop segmentation of sentinel-2 images. https: //github.com/mattiagatti/Sentinel-2-Crop-Mapping-Models , 2024
work page 2024
-
[29]
Rethinking atrous convolution for semantic image segmentation, 2017
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017. 8
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.