arxiv: 2604.13722 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests

Pankaj Deoli , Atef Tej , Anmol Ashri , Anandatirtha JS , Karsten Berns

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords instance segmentationdomain transfersynthetic dataforestrygranularitydistillationtree detection

0 comments

The pith

Granularity-aware distillation transfers fine-grained synthetic tree annotations to improve segmentation on real coarse-labeled forest images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles synthetic-to-real transfer for tree instance segmentation where synthetic data offers fine trunk and crown labels but real data has only coarse tree labels. It proposes the MGTD dataset with thousands of images from both domains and a four-stage protocol to separate domain shift from granularity issues. The key method is granularity-aware distillation that merges logits from synthetic teachers and unifies masks to pass structural priors to a student model trained on coarse labels. This results in better mask average precision, especially for small and distant trees. A reader would care because it shows how to exploit detailed synthetic data to compensate for limited real annotations in practical forestry applications.

Core claim

The authors establish that granularity-aware distillation, which performs logit-space merging and mask unification to transfer structural priors from fine-grained synthetic teachers to coarse-label students, yields consistent improvements in mask AP on real forest images despite domain shift and label coarseness.

What carries the argument

Granularity-aware distillation via logit-space merging and mask unification to align fine synthetic priors with coarse real labels.

If this is right

Consistent gains in mask AP for tree instance segmentation on real data.
Particular benefits for detecting small and distant trees.
Provides an isolated testbed for studying granularity mismatch in sim-to-real transfer.
Enables better use of synthetic data in scenarios with limited real labeling resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might extend to other segmentation tasks with hierarchical or multi-level labels, such as in urban scene parsing.
Combining this with other domain adaptation techniques could further reduce the performance gap.
If more detailed real labels become available through semi-supervised means, they could be integrated into the unification step for additional gains.

Load-bearing premise

Structural priors learned from fine-grained synthetic annotations about tree trunks and crowns remain transferable and beneficial even when the target real labels are coarse and the images come from a different domain.

What would settle it

Training a model solely on the real coarse labels and comparing its mask AP to the distilled model on the same real test set; if the distilled version shows no gain or worse performance, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.13722 by Anandatirtha JS, Anmol Ashri, Atef Tej, Karsten Berns, Pankaj Deoli.

**Figure 2.** Figure 2: Phase 2: Domain transfer. Comparison of predictions obtained from [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of real-only training (Phase 3) and granularity-aware dis [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 1.** Figure 1: Sample images from the MGTD dataset. Includes both the real and simulated [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

**Figure 2.** Figure 2: Phase 1: Instance segmentation results on Tree Trunk examples. Each row shows the RGB image, model prediction, and ground truth mask [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Phase 1: Instance segmentation results on Whole tree examples. Each row shows the RGB image, model prediction, and ground truth mask [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Phase 2: Domain transfer (Tree Trunk → Real Trees) [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Phase 2: Domain transfer (Whole Tree → Real Trees) [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Phase 3: Instance segmentation (Mask-RCNN with Swin-T backbone) results on real trees (trained directly on real data). Each row shows RGB image, prediction, and ground truth [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

**Figure 7.** Figure 7: Phase 1: Qualitative analysis of yolov11m (trained on simulated tree trunks) on the simulated tree trunks val set. RGB Image Prediction Ground Truth The qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Phase 1: Qualitative analysis of yolov8m (trained on simulated whole trees) on the simulated whole trees val set. RGB Image Prediction Ground Truth and cluttered backgrounds. The predictions align closely with the ground truth annotations, capturing both near- and far-field trees. Occasional errors occur in cases of severe occlusion or trees with very thin stems, where detections are sometimes fragmented … view at source ↗

**Figure 9.** Figure 9: Phase 2: Domain gap when the best simulated (tree trunk and whole tree model) was directly tested on real tree images. Qualitative examples further highlight these quantitative differences. Predictions on real images reveal that the trunk-trained model often detects only a limited subset of trees, focusing on highly salient or well-lit stems while missing thinner or background trunks, particularly under he… view at source ↗

read the original abstract

We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a targeted way to distill fine synthetic trunk/crown structure into models that see only coarse real tree labels, with reported AP gains on small trees, but the experimental support stays thin on controls and ablations.

read the letter

The main thing to know is that this work tackles a practical mismatch in forestry vision: synthetic data comes with detailed trunk and crown masks while real images have only coarse whole-tree labels. They introduce the MGTD dataset and a four-stage protocol that separates domain shift from granularity mismatch, then use logit-space merging plus mask unification in distillation to move structural priors across the gap. The reported gains, especially on small and distant trees, line up with the motivation that fine synthetic supervision can still help where real labels are limited.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces MGTD, a mixed-granularity dataset with 53k synthetic images providing fine-grained trunk/crown annotations and 3.6k real images with coarse tree labels. It proposes a four-stage protocol that isolates domain shift from granularity mismatch, along with granularity-aware distillation that transfers structural priors via logit-space merging and mask unification from a fine-grained synthetic teacher to a coarse-label student. Experiments report consistent mask AP gains, with particular benefits for small and distant trees.

Significance. If the reported gains prove robust, the work provides a practical testbed and technique for sim-to-real transfer in forestry instance segmentation under realistic label-granularity constraints. The isolation of factors in the protocol and the emphasis on small/distant trees align with application needs in forest inventory and perception.

minor comments (3)

Abstract: The four-stage protocol and logit-space merging/mask unification steps are described at a high level; a diagram or pseudocode in §3 would clarify how fine-grained priors survive the unification without introducing label-induced bias.
Abstract: No numerical AP values, baseline comparisons, or statistical tests are mentioned; the full experiments section should include these to substantiate the 'consistent gains' claim.
Abstract: Consider spelling out MGTD on first use and confirming whether the dataset will be released publicly, as it is positioned as a core contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We appreciate the recognition that the MGTD dataset and four-stage protocol provide a practical testbed for isolating domain shift from granularity mismatch, and that granularity-aware distillation offers a useful technique for transferring structural priors to coarse real labels, with benefits for small and distant trees.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical method consisting of a new mixed-granularity dataset and a four-stage transfer protocol that applies standard distillation and domain-adaptation techniques to tree instance segmentation. No mathematical derivations, first-principles predictions, or equations are described in the provided text. The central claims rest on reported experimental mask AP improvements rather than any reduction of outputs to fitted inputs or self-referential definitions by construction. No load-bearing self-citations or ansatz smuggling are visible; the approach is self-contained against external benchmarks and does not invoke uniqueness theorems or prior author results to force its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that synthetic fine-grained annotations encode transferable structural priors and that logit merging plus mask unification can bridge granularity without introducing new biases.

pith-pipeline@v0.9.0 · 5413 in / 1142 out tokens · 28805 ms · 2026-05-10T13:52:10.424947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 8 canonical work pages · 1 internal anchor

[1]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Chen, K., Wang, J., Pang, J., Cao, Y ., Xiong, Y ., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y ., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

work page Pith review arXiv 1906
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion

Deng, J., Li, W., Chen, Y ., Duan, L.: Unbiased mean teacher for cross-domain object detec- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 4091–4101 (2021)

2021
[3]

a multispectral imagery analy- sis

Deoli, P., Deshpande, S.A., Vierling, A., Berns, K.: Exploring image fusion techniques for off-road semantic segmentation in harsh lighting conditions. a multispectral imagery analy- sis. In: 2024 21st International Conference on Ubiquitous Robots (UR). pp. 566–573 (2024). https://doi.org/10.1109/UR61395.2024.10597528

work page doi:10.1109/ur61395.2024.10597528 2024
[4]

In: Proceedings of the 32nd International Conference on Neural Information Processing Sys- tems

Dubey, A., Gupta, O., Raskar, R., Naik, N.: Maximum entropy fine-grained classification. In: Proceedings of the 32nd International Conference on Neural Information Processing Sys- tems. p. 635–645. NIPS’18, Curran Associates Inc., Red Hook, NY , USA (2018)

2018
[5]

Ecological Informatics87, 103085 (2025)

Feng, Z., She, Y ., Keshav, S.: Spread: A large-scale, high-fidelity synthetic dataset for mul- tiple forest vision tasks. Ecological Informatics87, 103085 (2025)

2025
[6]

In: ICRA 2022 Workshop in Innovation in Forestry Robotics: Research and Industry Adoption (2022)

Grondin, V ., Pomerleau, F., Giguère, P.: Training deep learning algorithms on synthetic forest images for tree detection. In: ICRA 2022 Workshop in Innovation in Forestry Robotics: Research and Industry Adoption (2022)

2022
[7]

Open-vocabulary object detection via vision and language knowledge distillation,

Gu, X., Lin, T.Y ., Kuo, W., Cui, Y .: Open-vocabulary detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)

work page arXiv 2021
[8]

In: 2021 IEEE international conference on robotics and automation (ICRA)

Jiang, P., Osteen, P., Wigness, M., Saripalli, S.: Rellis-3d dataset: Data, benchmarks and analysis. In: 2021 IEEE international conference on robotics and automation (ICRA). pp. 1110–1116. IEEE (2021)

2021
[9]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., White- head, S., Berg, A.C., Lo, W.Y ., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023) 42 No Author Given

work page internal anchor Pith review arXiv 2023
[10]

In: Scandinavian Conference on Image Analysis

Lagos, J., Lempiö, U., Rahtu, E.: Finnwoodlands dataset. In: Scandinavian Conference on Image Analysis. pp. 95–110. Springer (2023)

2023
[11]

Ecological Informatics77, 102215 (2023).https://doi.org/https://doi.org/10.1016/j.ecoinf.2023

Li, R., Sun, G., Wang, S., Tan, T., Xu, F.: Tree trunk detection in urban scenes using a multiscale attention-based deep learning method. Ecological Informatics77, 102215 (2023).https://doi.org/https://doi.org/10.1016/j.ecoinf.2023. 102215,https://www.sciencedirect.com/science/article/pii/ S1574954123002443

work page doi:10.1016/j.ecoinf.2023 2023
[12]

arXiv preprint arXiv:2309.01279 (2023)

Puliti, S., Pearse, G., Surov `y, P., Wallace, L., Hollaus, M., Wielgosz, M., Astrup, R.: For- instance: a uav laser scanning benchmark dataset for semantic and instance segmentation of individual trees. arXiv preprint arXiv:2309.01279 (2023)

work page arXiv 2023
[13]

IEEE transactions on pattern analysis and machine intelligence 39(6), 1137–1149 (2016)

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39(6), 1137–1149 (2016)

2016
[14]

Leveraging vision language models for specialized agricultural tasks

Steininger, D., Simon, J., Trondl, A., Murschitz, M.: Timbervision: A multi-task dataset and framework for log-component segmentation and tracking in autonomous forestry operations. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). pp. 5601–5610 (2025).https://doi.org/10.1109/WACV61041.2025.00547

work page doi:10.1109/wacv61041.2025.00547 2025
[15]

Advances in neural information processing systems30(2017)

Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consis- tency targets improve semi-supervised deep learning results. Advances in neural information processing systems30(2017)

2017
[16]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Tranheden, W., Olsson, V ., Pinto, J., Svensson, L.: Dacs: Domain adaptation via cross- domain mixed sampling. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1379–1389 (2021)

2021
[17]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

Tsai, Y .H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

2018
[18]

test (2020)

Weinstein, B., Marconi, S., Zare, A., Bohlman, S., Graves, S., Singh, A., White, E.: Neon tree crowns dataset. test (2020)

2020
[19]

In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wigness, M., Eum, S., Rogers, J.G., Han, D., Kwon, H.: A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5000–5007. IEEE (2019)

2019
[20]

International Journal of Automation and Computing 14(01 2017).https://doi.org/10.1007/s11633-017-1053-3

Zhao, B., Feng, J., Wu, X., Yan, S.: A survey on deep learning-based fine-grained object clas- sification and semantic segmentation. International Journal of Automation and Computing 14(01 2017).https://doi.org/10.1007/s11633-017-1053-3

work page doi:10.1007/s11633-017-1053-3 2017
[21]

In: ECCV (2022)

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: ECCV (2022)

2022