pith. sign in

arxiv: 2412.01944 · v2 · pith:W23ENRLQnew · submitted 2024-12-02 · 💻 cs.CV · eess.IV

A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

Pith reviewed 2026-05-23 07:42 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords crop segmentationsatellite image time seriestransformerCNNSentinel-2temporal modelingsemantic segmentation
0
0 comments X

The pith

TSViT slightly surpasses 3D U-Net for crop segmentation from satellite time series, with VistaFormer offering best efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares convolutional and transformer-based models for segmenting crops in satellite image time series from Sentinel-2. It evaluates 3D CNN variants and three transformer architectures that handle temporal dependencies differently. TSViT delivers the highest accuracy, narrowly ahead of 3D U-Net, while VistaFormer leads in computational efficiency. This establishes that how temporal information is modeled matters more for performance than whether the base network is convolutional or transformer-based. Readers interested in remote sensing applications would care because better crop maps improve agricultural monitoring and land-use analysis.

Core claim

Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS.

What carries the argument

Different strategies for capturing temporal dependencies in transformer architectures versus 3D convolutional networks for processing multispectral time series data.

If this is right

  • Temporal modelling is critical for satellite image time series tasks.
  • Transformers that explicitly model temporal dynamics outperform those that treat time as an additional spatial dimension.
  • TSViT outperforms the tested CNN models on the given datasets.
  • VistaFormer provides a strong efficiency-performance trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed superiority of TSViT may not hold for other geographic regions or sensor types not tested here.
  • VistaFormer's efficiency could enable deployment on resource-constrained systems for large-area monitoring.
  • Extending the comparison to include more recent transformer variants or hybrid models could refine the efficiency-accuracy frontier.

Load-bearing premise

The Munich and Lombardia datasets together with the chosen training and evaluation protocols are representative enough that the observed ranking of models will generalize to other regions, sensors, or crop types.

What would settle it

Evaluating the models on a new Sentinel-2 dataset from a different agricultural region where 3D U-Net or another CNN achieves higher accuracy than TSViT would falsify the claim that TSViT is generally superior.

Figures

Figures reproduced from arXiv: 2412.01944 by Anwar Ur Rehman, Christian Loschiavo, Ignazio Gallo, Mattia Gatti, Mirco Boschetti, Nicola Landro, Riccardo La Grassa.

Figure 1
Figure 1. Figure 1: Proposed adaptation of the Swin UNETR to use with Sentinel 2 time-series. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three random samples of input-output pairs from the Munich dataset. On the top, the input is shown as an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three random samples of input-output pairs from the Lombardia dataset. On the top, the input is shown as an [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of a good prediction made by the Swin UNETR model on the Munich dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example of a bad prediction made by the Swin UNETR model on the Munich dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of a bad prediction made by the Swin UNETR model on the Lombardia dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of a good prediction made by the Swin UNETR model on the Lombardia dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model’s predictions for Lombardia Test A. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
read the original abstract

Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents an empirical comparison of CNN-based (3D U-Net, 3D FPN, 3D DeepLabv3) and transformer-based (Swin UNETR, TSViT, VistaFormer) segmentation models for crop mapping from Sentinel-2 satellite image time series. On the Munich and Lombardia datasets, TSViT achieves the highest overall performance (slightly above the 3D U-Net baseline), VistaFormer offers the best efficiency, and explicit temporal modeling is identified as critical, with approaches treating time as an extra spatial dimension performing less well.

Significance. If the reported ranking is robust, the study supplies actionable guidance for architecture selection in SITS-based agricultural monitoring, particularly the benefit of dedicated temporal mechanisms over 3D convolutions and the efficiency of VistaFormer. The work is a straightforward empirical ranking with no parameter-free derivations or machine-checked proofs.

major comments (1)
  1. [Abstract] Abstract: the central claim that TSViT is best overall and that explicit temporal modeling is critical rests exclusively on results from the Munich and Lombardia datasets. Both are mid-latitude European Sentinel-2 scenes with overlapping crop calendars; no cross-region, cross-sensor, or cross-crop-type experiments are described that would separate architecture effects from data-distribution effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our empirical comparison. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that TSViT is best overall and that explicit temporal modeling is critical rests exclusively on results from the Munich and Lombardia datasets. Both are mid-latitude European Sentinel-2 scenes with overlapping crop calendars; no cross-region, cross-sensor, or cross-crop-type experiments are described that would separate architecture effects from data-distribution effects.

    Authors: We agree that the reported ranking and the emphasis on explicit temporal modeling are derived solely from the Munich and Lombardia datasets, which share similar mid-latitude European characteristics and crop calendars. The abstract already names these datasets, but we acknowledge that the central claims would benefit from clearer qualification regarding generalizability. In the revised version we will (1) update the abstract to state that the performance ordering holds on these two Sentinel-2 scenes and (2) add a short limitations paragraph noting that architecture effects have not been isolated from data-distribution effects and that cross-region or cross-sensor validation remains future work. We believe the comparative results still offer actionable guidance for similar agricultural monitoring settings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model ranking on fixed datasets

full rationale

The paper reports an empirical comparison of off-the-shelf CNN and transformer segmentation architectures (3D U-Net, TSViT, VistaFormer, etc.) on the Munich and Lombardia Sentinel-2 datasets. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. All claims reduce to direct performance metrics on the chosen data splits; the ranking is therefore not forced by construction or by prior self-referential results. This is the normal, non-circular outcome for a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical comparative study. No free parameters are fitted as part of a derivation, no mathematical axioms are invoked beyond standard deep-learning assumptions, and no new physical or mathematical entities are postulated.

pith-pipeline@v0.9.0 · 5757 in / 1080 out tokens · 28088 ms · 2026-05-23T07:42:44.895449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    A review of deep learning methods for semantic segmentation of remote sensing imagery

    Xiaohui Yuan, Jianfang Shi, and Lichuan Gu. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications , 169:114417, 2021

  2. [2]

    Convolutional neural networks based potholes detection using thermal imaging

    Aparna, Yukti Bhatia, Rachna Rai, Varun Gupta, Naveen Aggarwal, and Aparna Akula. Convolutional neural networks based potholes detection using thermal imaging. Journal of King Saud University - Computer and Information Sciences, 34(3):578–588, 2022

  3. [3]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020

  4. [4]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

  5. [5]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing

  6. [6]

    Torr, and Li Zhang

    Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6881–6890, June 2021

  7. [7]

    Sits-former: A pre-trained spatio- spectral-temporal representation model for sentinel-2 time series classification

    Yuan Yuan, Lei Lin, Qingshan Liu, Renlong Hang, and Zeng-Guang Zhou. Sits-former: A pre-trained spatio- spectral-temporal representation model for sentinel-2 time series classification. International Journal of Applied Earth Observation and Geoinformation , 106:102651, 2022

  8. [8]

    Ctgan : Cloud transformer generative adversarial network

    Gi-Luen Huang and Pei-Yuan Wu. Ctgan : Cloud transformer generative adversarial network. In 2022 IEEE International Conference on Image Processing (ICIP) , pages 511–515, 2022

  9. [9]

    Libo Wang, RUI LI, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter Atkinson. Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery.ISPRS Journal of Photogrammetry and Remote Sensing , 190:196–214, 06 2022

  10. [10]

    Hsi-bert: Hyperspectral image classification using the bidirectional encoder representation from transformers

    Ji He, Lina Zhao, Hongwei Yang, Mengmeng Zhang, and Wei Li. Hsi-bert: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Transactions on Geoscience and Remote Sensing, 58(1):165–178, 2020

  11. [11]

    Spectral- former: Rethinking hyperspectral image classification with transformers

    Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectral- former: Rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2022

  12. [12]

    Al Rahhal, Reham Al Dayil, and Naif Al Ajlan

    Yakoub Bazi, Laila Bashmal, Mohamad M. Al Rahhal, Reham Al Dayil, and Naif Al Ajlan. Vision transformers for remote sensing image classification. Remote Sensing, 13(3), 2021

  13. [13]

    Multi- temporal, multi-frequency, and multi-polarization coherence and sar backscatter analysis of wetlands

    Fariba Mohammadimanesh, Bahram Salehi, Masoud Mahdianpari, Brian Brisco, and Mahdi Motagh. Multi- temporal, multi-frequency, and multi-polarization coherence and sar backscatter analysis of wetlands. ISPRS Journal of Photogrammetry and Remote Sensing , 142:78–93, 2018

  14. [14]

    A deep learning approach for burned area segmentation with sentinel-2 data

    Lisa Knopp, Marc Wieland, Michaela Rättich, and Sandro Martinis. A deep learning approach for burned area segmentation with sentinel-2 data. Remote Sensing, 12(15), 2020

  15. [15]

    Clouds classification from sentinel-2 imagery with deep residual learning and semantic image segmen- tation

    Cheng-Chien Liu, Yu-Cheng Zhang, Pei-Yin Chen, Chien-Chih Lai, Yi-Hsin Chen, Ji-Hong Cheng, and Ming- Hsun Ko. Clouds classification from sentinel-2 imagery with deep residual learning and semantic image segmen- tation. Remote Sensing, 11(2), 2019

  16. [16]

    Convolutional neural networks for water segmentation using sentinel-2 red, green, blue (rgb) composites and derived spectral indices

    Thomas James, Calogero Schillaci, and Aldo Lipani. Convolutional neural networks for water segmentation using sentinel-2 red, green, blue (rgb) composites and derived spectral indices. International Journal of Remote Sensing , 42(14):5338–5365, 2021

  17. [17]

    Sentinel 2 time series analysis with 3d feature pyramid network and time domain class activation intervals for crop mapping

    Ignazio Gallo, Riccardo La Grassa, Nicola Landro, and Mirco Boschetti. Sentinel 2 time series analysis with 3d feature pyramid network and time domain class activation intervals for crop mapping. ISPRS International Journal of Geo-Information, 10(7), 2021. 7

  18. [18]

    Swin transformer and deep convolutional neural networks for coastal wetland classification using sentinel-1, sentinel-2, and lidar data

    Ali Jamali and Masoud Mahdianpari. Swin transformer and deep convolutional neural networks for coastal wetland classification using sentinel-1, sentinel-2, and lidar data. Remote Sensing, 14(2), 2022

  19. [19]

    A deep learning framework based on generative adversarial networks and vision transformer for complex wetland classification using limited training samples

    Ali Jamali, Masoud Mahdianpari, Fariba Mohammadimanesh, and Saeid Homayouni. A deep learning framework based on generative adversarial networks and vision transformer for complex wetland classification using limited training samples. International Journal of Applied Earth Observation and Geoinformation , 115:103095, 2022

  20. [20]

    Agnès Bégué, Damien Arvor, Beatriz Bellon, Julie Betbeder, Diego De Abelleyra, Rodrigo P. D. Ferraz, Valentine Lebourgeois, Camille Lelong, Margareth Simões, and Santiago R. Verón. Remote sensing and cropping practices: A review. Remote Sensing, 10(1), 2018

  21. [21]

    Shyamal Virnodkar, V . K. Pachghare, and Sagar Murade. A technique to classify sugarcane crop from sentinel-2 satellite imagery using u-net architecture. In Chhabi Rani Panigrahi, Bibudhendu Pati, Prasant Mohapatra, Rajkumar Buyya, and Kuan-Ching Li, editors, Progress in Advanced Computing and Intelligent Engineering , pages 322–330, Singapore, 2021. Spri...

  22. [22]

    Crop type mapping by using transfer learning

    Artur Nowakowski, John Mrziglod, Dario Spiller, Rogerio Bonifacio, Irene Ferrari, Pierre Philippe Mathieu, Manuel Garcia-Herranz, and Do-Hyung Kim. Crop type mapping by using transfer learning. International Journal of Applied Earth Observation and Geoinformation , 98:102313, 2021

  23. [23]

    Multi-temporal land cover classification with sequential recurrent encoders

    Marc Rußwurm and Marco Körner. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS International Journal of Geo-Information , 7(4):129, 2018

  24. [24]

    Cctnet: Coupled cnn and transformer network for crop segmentation of remote sensing images

    Hong Wang, Xianzhong Chen, Tianxiang Zhang, Zhiyong Xu, and Jiangyun Li. Cctnet: Coupled cnn and transformer network for crop segmentation of remote sensing images. Remote Sensing, 14(9):1956, 2022

  25. [25]

    Hsi-transunet: A transformer based semantic segmentation model for crop mapping from uav hyperspectral imagery.Computers and Electronics in Agriculture, 201:107297, 2022

    Bowen Niu, Quanlong Feng, Boan Chen, Cong Ou, Yiming Liu, and Jianyu Yang. Hsi-transunet: A transformer based semantic segmentation model for crop mapping from uav hyperspectral imagery.Computers and Electronics in Agriculture, 201:107297, 2022

  26. [26]

    Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images, 2022

    Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images, 2022

  27. [27]

    In-season and dynamic crop mapping using 3d convolution neural networks and sentinel-2 time series

    Ignazio Gallo, Luigi Ranghetti, Nicola Landro, Riccardo La Grassa, and Mirco Boschetti. In-season and dynamic crop mapping using 3d convolution neural networks and sentinel-2 time series. ISPRS Journal of Photogrammetry and Remote Sensing, 195:335–352, 2023

  28. [28]

    Convolutional and transformer network for crop segmentation of sentinel-2 images

    Mattia Gatti. Convolutional and transformer network for crop segmentation of sentinel-2 images. https: //github.com/mattiagatti/Sentinel-2-Crop-Mapping-Models , 2024

  29. [29]

    Rethinking atrous convolution for semantic image segmentation, 2017

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017. 8