pith. sign in

arxiv: 2601.10931 · v2 · submitted 2026-01-16 · 💻 cs.CV · cs.AI

Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images

Pith reviewed 2026-05-16 14:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords tree canopy segmentationaerial imagerysmall dataset fine-tuningpretrained modelsYOLOv11Mask R-CNNvision transformersdata scarcity
0
0 comments X

The pith

Pretrained CNN models like YOLOv11 and Mask R-CNN outperform transformer models when fine-tuned for tree canopy segmentation on just 150 images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests five pretrained models on a small set of 150 aerial images to segment tree canopies, simulating real-world annotation shortages. It shows that convolution-based models, especially YOLOv11 and Mask R-CNN, produce stronger results on unseen data than transformer-based ones such as Swin-UNet and DINOv2. The work matters for environmental tasks because accurate canopy maps support urban planning and ecosystem monitoring without needing large labeled collections. Readers would care since the findings point to practical model choices that avoid heavy overfitting when data is scarce. Experiments also cover training strategies and augmentations that help keep performance stable under these constraints.

Core claim

When fine-tuned on only 150 annotated aerial images for tree canopy segmentation, pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeepLabv3, Swin-UNet, and DINOv2 underperform, owing to differences between semantic and instance segmentation tasks, the high data needs of vision transformers, and the absence of strong inductive biases in transformers.

What carries the argument

Fine-tuning and direct comparison of five architectures—YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2—on the Solafune Tree Canopy Detection dataset of 150 images to measure generalization under extreme data scarcity.

Load-bearing premise

The performance gaps between convolution-based and transformer-based models stem mainly from architectural differences rather than from specific hyperparameter choices, augmentation details, or dataset biases.

What would settle it

A controlled rerun of all five models on the same 150-image splits using identical hyperparameters, augmentation pipelines, and training schedules, then checking whether the accuracy ordering between CNN and transformer groups remains the same.

Figures

Figures reproduced from arXiv: 2601.10931 by Anthony Bertnyk, David Szczecina, Hudson Sun, Kyle Gao, Lincoln Linlin Xu, Niloofar Azad.

Figure 1
Figure 1. Figure 1: Examples of Training Images with Different Resolutions and Scenes and their Segmentation Labels. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Tree canopy Segmentation Results. Left to Right: Raw image, YOLOv11, Mask-RCNN, DeepLabv3, Swin-UNet, DinoV2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates five pretrained models—YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2—for tree canopy segmentation on a small, imbalanced dataset of 150 annotated aerial images from the Solafune competition. It claims that convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than transformer-based models, attributing the gaps to stronger inductive biases in CNNs, higher data requirements for Vision Transformers, and differences between instance and semantic segmentation tasks. The work includes analysis of training strategies and augmentation policies under extreme data scarcity.

Significance. If the performance gaps are shown to arise from architecture rather than unequal optimization, the findings would provide practical guidance for selecting models in low-data remote sensing segmentation, confirming that lightweight CNNs remain reliable when annotations are scarce. The emphasis on fine-tuning details for environmental monitoring tasks adds applied value, though the lack of reported metrics limits immediate generalizability.

major comments (2)
  1. [Abstract] Abstract and Experiments section: The central claim that 'pretrained convolution-based models... generalize significantly better' is load-bearing but unsupported by any reported quantitative metrics (e.g., IoU, mAP, Dice scores), error bars, or statistical tests, leaving the magnitude and reliability of the gaps unverified.
  2. [Experimental Setup] Experimental Setup and Analysis sections: The attribution of gaps to 'differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases' requires explicit evidence that all five models received equivalent hyperparameter search budgets, augmentation policies, loss formulations, and optimization effort. Without a table or protocol detailing trials per model, the comparison risks confounding architectural effects with tuning differences.
minor comments (1)
  1. [Abstract] Abstract: Inconsistent model naming ('DeeplabV3' vs. 'DeepLabv3') should be standardized for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and have incorporated revisions to strengthen the quantitative support and experimental transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: The central claim that 'pretrained convolution-based models... generalize significantly better' is load-bearing but unsupported by any reported quantitative metrics (e.g., IoU, mAP, Dice scores), error bars, or statistical tests, leaving the magnitude and reliability of the gaps unverified.

    Authors: We agree that explicit quantitative metrics are necessary to substantiate the central claim. In the revised manuscript, we have added a results table in the Experiments section reporting IoU, mAP, and Dice scores for all five models, including error bars from five independent runs with different random seeds and statistical significance tests (paired t-tests) comparing CNN-based versus transformer-based models. These additions directly quantify the performance gaps and will be referenced in the abstract. revision: yes

  2. Referee: [Experimental Setup] Experimental Setup and Analysis sections: The attribution of gaps to 'differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases' requires explicit evidence that all five models received equivalent hyperparameter search budgets, augmentation policies, loss formulations, and optimization effort. Without a table or protocol detailing trials per model, the comparison risks confounding architectural effects with tuning differences.

    Authors: We acknowledge the importance of demonstrating equivalent optimization effort to support architectural attributions. The revised Experimental Setup section now includes a detailed protocol describing the hyperparameter search procedure (grid search over learning rate, batch size, and augmentation strength), the number of trials performed per model (approximately 20–25 configurations each), and the final selected settings. A new table summarizes augmentation policies, loss formulations (e.g., Dice + BCE for semantic models, mask loss for instance models), and optimizer choices, with explicit notes on any task-specific adaptations. This documentation confirms comparable tuning budgets across models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on fixed small dataset

full rationale

The paper conducts a direct experimental comparison of five pretrained architectures fine-tuned on the same 150-image Solafune dataset for canopy segmentation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance differences are reported from training runs; the conclusion that CNN-based models generalize better is an empirical observation, not a reduction to inputs by construction. The analysis remains self-contained against external benchmarks (held-out test images) with no uniqueness theorems or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper containing no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5546 in / 908 out tokens · 40157 ms · 2026-05-16T14:02:24.558449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Tree canopy detection — compe- tition overview,

    Solafune, Inc., “Tree canopy detection — compe- tition overview,” https://solafune.com/competitions/ 26ff758c-7422-4cd1-bfe0-daecfc40db70?menu=about&tab= overview, 2025, accessed: 2025-10-09

  2. [2]

    Multi-level self-adaptive individual tree detection for coniferous forest using airborne lidar,

    Z. Hui, P. Cheng, B. Yang, and G. Zhou, “Multi-level self-adaptive individual tree detection for coniferous forest using airborne lidar,”International Journal of Applied Earth Observation and Geoinformation, vol. 114, p. 103028,

  3. [3]

    Available: https://www.sciencedirect.com/ science/article/pii/S1569843222002163

    [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S1569843222002163

  4. [4]

    lidr: An r package for analysis of airborne laser scanning (als) data,

    J.-R. Roussel, D. Auty, N. C. Coops, P. Tompalski, T. R. Goodbody, A. S. Meador, J.-F. Bourdon, F. de Boissieu, and A. Achim, “lidr: An r package for analysis of airborne laser scanning (als) data,”Remote Sensing of Environment, vol. 251, p. 112061, 2020. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S0034425720304314

  5. [5]

    3d segmentation of trees through a flexible multiclass graph cut algorithm,

    J. Williams, C.-B. Sch ¨onlieb, T. Swinfield, J. Lee, X. Cai, L. Qie, and D. A. Coomes, “3d segmentation of trees through a flexible multiclass graph cut algorithm,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 2, pp. 754–776, 2020

  6. [6]

    Nystr ¨om-based spectral clustering using airborne lidar point cloud data for individual tree segmentation,

    Y . Pang, W. Wang, L. Du, Z. Zhang, X. Liang, Y . Li, and Z. Wang, “Nystr ¨om-based spectral clustering using airborne lidar point cloud data for individual tree segmentation,” International Journal of Digital Earth, vol. 14, no. 10, pp. 1452–1476, 2021. [Online]. Available: https://doi.org/10.1080/ 17538947.2021.1943018

  7. [7]

    Individual tree crown segmentation and crown width extraction from a heightmap derived from aerial laser scanning data using a deep learning framework,

    C. Sun, C. Huang, H. Zhang, B. Chen, F. An, L. Wang, and T. Yun, “Individual tree crown segmentation and crown width extraction from a heightmap derived from aerial laser scanning data using a deep learning framework,” Frontiers in Plant Science, vol. V olume 13 - 2022,

  8. [8]

    Available: https://www.frontiersin.org/journals/ plant-science/articles/10.3389/fpls.2022.914974

    [Online]. Available: https://www.frontiersin.org/journals/ plant-science/articles/10.3389/fpls.2022.914974

  9. [9]

    Individual rubber tree segmentation based on ground-based lidar data and faster r-cnn of deep learning,

    J. Wang, X. Chen, L. Cao, F. An, B. Chen, L. Xue, and T. Yun, “Individual rubber tree segmentation based on ground-based lidar data and faster r-cnn of deep learning,”F orests, vol. 10, no. 9, 2019. [Online]. Available: https://www.mdpi.com/1999-4907/10/9/793

  10. [10]

    Implementing deep learning algorithms for urban tree detection and geolocation with high-resolution aerial, satellite, and ground-level images,

    L. Velasquez-Camacho, M. Etxegarai, and S. de Miguel, “Implementing deep learning algorithms for urban tree detection and geolocation with high-resolution aerial, satellite, and ground-level images,”Computers, Environment and Urban Systems, vol. 105, p. 102025, 2023. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S0198971523000881

  11. [11]

    Very high resolution canopy height maps from rgb imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar,

    J. Tolan, H.-I. Yang, B. Nosarzewski, G. Couairon, H. V . V o, J. Brandt, J. Spore, S. Majumdar, D. Haziza, J. Vamaraju, T. Moutakanni, P. Bojanowski, T. Johns, B. White, T. Tiecke, and C. Couprie, “Very high resolution canopy height maps from rgb imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar,”Remote Se...

  12. [12]

    Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review,

    S. Takahashi, Y . Sakaguchi, N. Kouno, K. Takasawa, K. Ishizu, Y . Akagi, R. Aoyama, N. Teraya, N. Shinkai, H. Machino, K. Kobayashi, K. Asada, M. Komatsu, S. Kaneko, M. Sugiyama, and R. Hamamoto, “Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review,”Journal of Medical Systems, vol. 48, p. 84, 09 2024

  13. [13]

    Ten deep learning techniques to address small data problems with remote sensing,

    A. Safonova, G. Ghazaryan, S. Stiller, M. Main-Knorn, C. Nendel, and M. Ryo, “Ten deep learning techniques to address small data problems with remote sensing,”International Journal of Applied Earth Observation and Geoinformation, vol. 125, p. 103569, 2023. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S156984322300393X

  14. [14]

    Ultralytics yolo11,

    G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics

  15. [15]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,”

  16. [16]

    Mask R-CNN

    [Online]. Available: https://arxiv.org/abs/1703.06870

  17. [17]

    Re- thinking atrous convolution for semantic image segmentation,

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re- thinking atrous convolution for semantic image segmentation,”

  18. [18]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    [Online]. Available: https://arxiv.org/abs/1706.05587

  19. [19]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    [Online]. Available: https://arxiv.org/abs/2304.07193

  21. [21]

    Swin-unet: Unet-like pure transformer for medical image segmentation,

    H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and Y . Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,”arXiv preprint arXiv:2105.05537, 2021