Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images
Pith reviewed 2026-05-16 14:02 UTC · model grok-4.3
The pith
Pretrained CNN models like YOLOv11 and Mask R-CNN outperform transformer models when fine-tuned for tree canopy segmentation on just 150 images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When fine-tuned on only 150 annotated aerial images for tree canopy segmentation, pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeepLabv3, Swin-UNet, and DINOv2 underperform, owing to differences between semantic and instance segmentation tasks, the high data needs of vision transformers, and the absence of strong inductive biases in transformers.
What carries the argument
Fine-tuning and direct comparison of five architectures—YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2—on the Solafune Tree Canopy Detection dataset of 150 images to measure generalization under extreme data scarcity.
Load-bearing premise
The performance gaps between convolution-based and transformer-based models stem mainly from architectural differences rather than from specific hyperparameter choices, augmentation details, or dataset biases.
What would settle it
A controlled rerun of all five models on the same 150-image splits using identical hyperparameters, augmentation pipelines, and training schedules, then checking whether the accuracy ordering between CNN and transformer groups remains the same.
Figures
read the original abstract
Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates five pretrained models—YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2—for tree canopy segmentation on a small, imbalanced dataset of 150 annotated aerial images from the Solafune competition. It claims that convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than transformer-based models, attributing the gaps to stronger inductive biases in CNNs, higher data requirements for Vision Transformers, and differences between instance and semantic segmentation tasks. The work includes analysis of training strategies and augmentation policies under extreme data scarcity.
Significance. If the performance gaps are shown to arise from architecture rather than unequal optimization, the findings would provide practical guidance for selecting models in low-data remote sensing segmentation, confirming that lightweight CNNs remain reliable when annotations are scarce. The emphasis on fine-tuning details for environmental monitoring tasks adds applied value, though the lack of reported metrics limits immediate generalizability.
major comments (2)
- [Abstract] Abstract and Experiments section: The central claim that 'pretrained convolution-based models... generalize significantly better' is load-bearing but unsupported by any reported quantitative metrics (e.g., IoU, mAP, Dice scores), error bars, or statistical tests, leaving the magnitude and reliability of the gaps unverified.
- [Experimental Setup] Experimental Setup and Analysis sections: The attribution of gaps to 'differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases' requires explicit evidence that all five models received equivalent hyperparameter search budgets, augmentation policies, loss formulations, and optimization effort. Without a table or protocol detailing trials per model, the comparison risks confounding architectural effects with tuning differences.
minor comments (1)
- [Abstract] Abstract: Inconsistent model naming ('DeeplabV3' vs. 'DeepLabv3') should be standardized for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and have incorporated revisions to strengthen the quantitative support and experimental transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: The central claim that 'pretrained convolution-based models... generalize significantly better' is load-bearing but unsupported by any reported quantitative metrics (e.g., IoU, mAP, Dice scores), error bars, or statistical tests, leaving the magnitude and reliability of the gaps unverified.
Authors: We agree that explicit quantitative metrics are necessary to substantiate the central claim. In the revised manuscript, we have added a results table in the Experiments section reporting IoU, mAP, and Dice scores for all five models, including error bars from five independent runs with different random seeds and statistical significance tests (paired t-tests) comparing CNN-based versus transformer-based models. These additions directly quantify the performance gaps and will be referenced in the abstract. revision: yes
-
Referee: [Experimental Setup] Experimental Setup and Analysis sections: The attribution of gaps to 'differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases' requires explicit evidence that all five models received equivalent hyperparameter search budgets, augmentation policies, loss formulations, and optimization effort. Without a table or protocol detailing trials per model, the comparison risks confounding architectural effects with tuning differences.
Authors: We acknowledge the importance of demonstrating equivalent optimization effort to support architectural attributions. The revised Experimental Setup section now includes a detailed protocol describing the hyperparameter search procedure (grid search over learning rate, batch size, and augmentation strength), the number of trials performed per model (approximately 20–25 configurations each), and the final selected settings. A new table summarizes augmentation policies, loss formulations (e.g., Dice + BCE for semantic models, mask loss for instance models), and optimizer choices, with explicit notes on any task-specific adaptations. This documentation confirms comparable tuning budgets across models. revision: yes
Circularity Check
No circularity: purely empirical model comparison on fixed small dataset
full rationale
The paper conducts a direct experimental comparison of five pretrained architectures fine-tuned on the same 150-image Solafune dataset for canopy segmentation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance differences are reported from training runs; the conclusion that CNN-based models generalize better is an empirical observation, not a reduction to inputs by construction. The analysis remains self-contained against external benchmarks (held-out test images) with no uniqueness theorems or ansatzes imported from prior author work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models... due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate five representative architectures... on the Solafune Tree Canopy Detection dataset... 150 annotated images
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tree canopy detection — compe- tition overview,
Solafune, Inc., “Tree canopy detection — compe- tition overview,” https://solafune.com/competitions/ 26ff758c-7422-4cd1-bfe0-daecfc40db70?menu=about&tab= overview, 2025, accessed: 2025-10-09
work page 2025
-
[2]
Multi-level self-adaptive individual tree detection for coniferous forest using airborne lidar,
Z. Hui, P. Cheng, B. Yang, and G. Zhou, “Multi-level self-adaptive individual tree detection for coniferous forest using airborne lidar,”International Journal of Applied Earth Observation and Geoinformation, vol. 114, p. 103028,
-
[3]
Available: https://www.sciencedirect.com/ science/article/pii/S1569843222002163
[Online]. Available: https://www.sciencedirect.com/ science/article/pii/S1569843222002163
-
[4]
lidr: An r package for analysis of airborne laser scanning (als) data,
J.-R. Roussel, D. Auty, N. C. Coops, P. Tompalski, T. R. Goodbody, A. S. Meador, J.-F. Bourdon, F. de Boissieu, and A. Achim, “lidr: An r package for analysis of airborne laser scanning (als) data,”Remote Sensing of Environment, vol. 251, p. 112061, 2020. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S0034425720304314
work page 2020
-
[5]
3d segmentation of trees through a flexible multiclass graph cut algorithm,
J. Williams, C.-B. Sch ¨onlieb, T. Swinfield, J. Lee, X. Cai, L. Qie, and D. A. Coomes, “3d segmentation of trees through a flexible multiclass graph cut algorithm,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 2, pp. 754–776, 2020
work page 2020
-
[6]
Y . Pang, W. Wang, L. Du, Z. Zhang, X. Liang, Y . Li, and Z. Wang, “Nystr ¨om-based spectral clustering using airborne lidar point cloud data for individual tree segmentation,” International Journal of Digital Earth, vol. 14, no. 10, pp. 1452–1476, 2021. [Online]. Available: https://doi.org/10.1080/ 17538947.2021.1943018
-
[7]
C. Sun, C. Huang, H. Zhang, B. Chen, F. An, L. Wang, and T. Yun, “Individual tree crown segmentation and crown width extraction from a heightmap derived from aerial laser scanning data using a deep learning framework,” Frontiers in Plant Science, vol. V olume 13 - 2022,
work page 2022
-
[8]
Available: https://www.frontiersin.org/journals/ plant-science/articles/10.3389/fpls.2022.914974
[Online]. Available: https://www.frontiersin.org/journals/ plant-science/articles/10.3389/fpls.2022.914974
-
[9]
J. Wang, X. Chen, L. Cao, F. An, B. Chen, L. Xue, and T. Yun, “Individual rubber tree segmentation based on ground-based lidar data and faster r-cnn of deep learning,”F orests, vol. 10, no. 9, 2019. [Online]. Available: https://www.mdpi.com/1999-4907/10/9/793
work page 2019
-
[10]
L. Velasquez-Camacho, M. Etxegarai, and S. de Miguel, “Implementing deep learning algorithms for urban tree detection and geolocation with high-resolution aerial, satellite, and ground-level images,”Computers, Environment and Urban Systems, vol. 105, p. 102025, 2023. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S0198971523000881
work page 2023
-
[11]
J. Tolan, H.-I. Yang, B. Nosarzewski, G. Couairon, H. V . V o, J. Brandt, J. Spore, S. Majumdar, D. Haziza, J. Vamaraju, T. Moutakanni, P. Bojanowski, T. Johns, B. White, T. Tiecke, and C. Couprie, “Very high resolution canopy height maps from rgb imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar,”Remote Se...
work page 2024
-
[12]
S. Takahashi, Y . Sakaguchi, N. Kouno, K. Takasawa, K. Ishizu, Y . Akagi, R. Aoyama, N. Teraya, N. Shinkai, H. Machino, K. Kobayashi, K. Asada, M. Komatsu, S. Kaneko, M. Sugiyama, and R. Hamamoto, “Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review,”Journal of Medical Systems, vol. 48, p. 84, 09 2024
work page 2024
-
[13]
Ten deep learning techniques to address small data problems with remote sensing,
A. Safonova, G. Ghazaryan, S. Stiller, M. Main-Knorn, C. Nendel, and M. Ryo, “Ten deep learning techniques to address small data problems with remote sensing,”International Journal of Applied Earth Observation and Geoinformation, vol. 125, p. 103569, 2023. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S156984322300393X
work page 2023
-
[14]
G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics
work page 2024
- [15]
-
[16]
[Online]. Available: https://arxiv.org/abs/1703.06870
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Re- thinking atrous convolution for semantic image segmentation,
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re- thinking atrous convolution for semantic image segmentation,”
-
[18]
Rethinking Atrous Convolution for Semantic Image Segmentation
[Online]. Available: https://arxiv.org/abs/1706.05587
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...
-
[20]
DINOv2: Learning Robust Visual Features without Supervision
[Online]. Available: https://arxiv.org/abs/2304.07193
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Swin-unet: Unet-like pure transformer for medical image segmentation,
H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and Y . Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,”arXiv preprint arXiv:2105.05537, 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.