Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

Ali Moghimi; In Seon Kim

arxiv: 2604.03505 · v1 · submitted 2026-04-03 · 💻 cs.CV

Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

In Seon Kim , Ali Moghimi This is my paper

Pith reviewed 2026-05-13 19:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords urban tree detectionmultimodal imagerydomain adaptationactive learningsemi-supervised learningsatellite imagerystreet view

0 comments

The pith

Hybrid learning on satellite and street-level images detects urban trees at 0.90 F1-score with a 12 percent gain over baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multimodal framework that first uses satellite imagery to locate candidate trees and then retrieves targeted street-level views for detailed verification. It tackles high annotation costs through domain adaptation from an existing labeled set to a new region, then compares semi-supervised learning, active learning, and their hybrid combination inside a transformer detector. The hybrid approach reached the highest accuracy while lowering both missed trees and false detections. This setup matters because accurate, low-cost tree maps enable better environmental monitoring, disaster recovery, and urban planning without relying on exhaustive manual surveys.

Core claim

The authors demonstrate that a multimodal pipeline combining satellite imagery for tree candidate localization with ground-level Google Street View images, domain adaptation to transfer knowledge from a source dataset, and a hybrid semi-supervised plus active learning strategy on a transformer-based model achieves an F1-score of 0.90. This result improves 12 percent over the baseline, while pure semi-supervised learning degrades due to confirmation bias and active learning improves steadily through focused human labeling of uncertain cases.

What carries the argument

Multimodal candidate localization from satellite imagery followed by targeted street-view verification, powered by domain adaptation and a hybrid active-semi-supervised learning loop on a transformer detector.

Load-bearing premise

Domain adaptation successfully bridges the gap between the source annotated dataset and the target region without introducing significant biases or performance drops.

What would settle it

Running the hybrid model on a new urban region with no further active-learning labels and measuring whether the F1-score stays at or above 0.90 or falls below the original baseline.

Figures

Figures reproduced from arXiv: 2604.03505 by Ali Moghimi, In Seon Kim.

**Figure 2.** Figure 2: Flowcharts for the three annotation-efficient learning strategies. (A) Flowchart for the semi-supervised pipeline. The model is trained on the labeled train/validation dataset and deployed on an unlabeled pool. Detections with confidence above 0.8 are automatically accepted as pseudo-labels and merged with the original dataset for retraining, while lower-confidence predictions are placed back in the unlabe… view at source ↗

**Figure 3.** Figure 3: Precision, Recall, and F1-Score curve of the satellite model canopy [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison of three learning strategies over 10 rounds: Semi-Supervised learning (SS), Active Learning (AL), and hybrid AL [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Error analysis and performance metric dynamics across the three [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Error analysis of the tree detection model. (A) Multiple overlapping [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of the number of pseudo-labeled and human-annotated samples added to the training dataset per round for di [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable multimodal pipeline for urban tree detection that cuts annotation costs via satellite pre-filtering plus hybrid learning, but the 12% gain claim rests on unverified domain adaptation with no supporting ablations or shift metrics.

read the letter

The core contribution here is a concrete pipeline that first uses satellite imagery to find candidate tree locations, then pulls targeted street-view images for finer detection. That step alone reduces wasteful sampling. They layer on domain adaptation from an existing labeled set, then compare semi-supervised, active, and hybrid learning on a transformer detector. The hybrid version hits F1 0.90 and beats the baseline by 12 percent, while pure semi-supervised degrades from confirmation bias in the pseudo-labels. Active learning and the hybrid both cut false positives and negatives in the error analysis. That bias observation is useful and worth noting for anyone using pseudo-labeling in this domain.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a multimodal framework for urban tree detection that first uses high-resolution satellite imagery to localize candidate trees and then retrieves targeted Google Street View images for detailed classification. Domain adaptation transfers knowledge from an existing annotated source dataset to a new target region, while three annotation-efficient strategies (semi-supervised learning, active learning, and their hybrid) are evaluated with a transformer-based detector. The central empirical claim is that the hybrid strategy attains an F1-score of 0.90, a 12% improvement over the baseline, accompanied by reduced false positives and negatives.

Significance. If the reported gains are shown to be robust and directly attributable to the hybrid strategy, the work would provide a practical, scalable route to annotation-light urban tree mapping with clear relevance to environmental monitoring, disaster assessment, and sustainable city planning. The multimodal fusion and guided annotation components address real deployment bottlenecks in remote-sensing computer vision.

major comments (2)

[Abstract and Results] Abstract and Results section: The headline claim of F1 = 0.90 with a 12% improvement is presented without the numerical baseline F1 value, without the number of experimental runs or standard deviations, and without any statistical significance test (e.g., paired t-test p-value). These omissions prevent verification that the observed gain is reliable and attributable to the hybrid strategy rather than to unstated experimental choices.
[Methods and Results] Methods and Results sections: No quantitative evidence is supplied for the success of domain adaptation (e.g., before/after F1 on the target domain, maximum mean discrepancy, or pseudo-label accuracy on held-out target samples). Without such diagnostics or an ablation that isolates the adaptation step from the semi-supervised and active-learning components, the central attribution of performance gains to the proposed hybrid pipeline cannot be substantiated.

minor comments (2)

[Abstract] The abstract states that semi-supervised learning exhibited 'progressive performance degradation' but supplies no iteration-wise F1 curve or confirmation-bias metric to quantify the effect.
[Results] Error-analysis discussion would be strengthened by a table or figure that directly compares false-positive and false-negative rates across all four conditions (baseline, semi-supervised, active, hybrid).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the manuscript to strengthen the reporting of results and the substantiation of the domain adaptation component.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The headline claim of F1 = 0.90 with a 12% improvement is presented without the numerical baseline F1 value, without the number of experimental runs or standard deviations, and without any statistical significance test (e.g., paired t-test p-value). These omissions prevent verification that the observed gain is reliable and attributable to the hybrid strategy rather than to unstated experimental choices.

Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript we will explicitly state the numerical baseline F1 value, report results averaged over multiple independent runs with standard deviations, and include a paired t-test (or equivalent) with p-value to confirm that the observed improvement is statistically significant and attributable to the hybrid strategy rather than experimental variability. revision: yes
Referee: [Methods and Results] Methods and Results sections: No quantitative evidence is supplied for the success of domain adaptation (e.g., before/after F1 on the target domain, maximum mean discrepancy, or pseudo-label accuracy on held-out target samples). Without such diagnostics or an ablation that isolates the adaptation step from the semi-supervised and active-learning components, the central attribution of performance gains to the proposed hybrid pipeline cannot be substantiated.

Authors: We acknowledge the value of explicit diagnostics for the domain adaptation step. We will add before/after F1 scores on the target domain and include a dedicated ablation study that isolates the contribution of domain adaptation from the semi-supervised and active-learning components. These additions will be placed in the Methods and Results sections to directly substantiate the attribution of gains to the overall hybrid pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical evaluation

full rationale

The paper reports experimental F1-scores (hybrid strategy reaching 0.90, +12% over baseline) obtained from applying domain adaptation, semi-supervised learning, active learning, and a hybrid combination to a transformer-based detector on satellite and Street View imagery. No equations, parameter fits, or predictions are presented that reduce to the inputs by construction; performance numbers are measured outcomes on held-out data rather than self-definitional or self-cited derivations. No load-bearing self-citations or uniqueness theorems are invoked to justify core results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of domain adaptation for cross-region transfer and the superiority of hybrid annotation strategies; no free parameters or new entities are introduced beyond standard deep-learning components.

axioms (1)

domain assumption Domain adaptation can effectively transfer knowledge from an existing annotated dataset to a new region of interest
Invoked to address the annotation bottleneck when moving to a new urban area.

pith-pipeline@v0.9.0 · 5570 in / 1359 out tokens · 65607 ms · 2026-05-13T19:28:33.948346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

12 Maas, J., van Dillen, S.M., Verheij, R.A., Groenewegen, P.P.,

doi:10.1016/J.ISPRSJPRS.2021.01.016. 12 Maas, J., van Dillen, S.M., Verheij, R.A., Groenewegen, P.P.,

work page doi:10.1016/j.isprsjprs.2021.01.016 2021
[2]

Health & Place 15, 586–595

Social contacts as a possible mechanism behind the relation between green space and health. Health & Place 15, 586–595. doi:10.1016/J.HEALTHPLACE.2008.09.006. Morgenroth, J., Doick, K., Hauer, R., Locke, D.H., Barona, C.O., Roman, L.A., Conway, T.M., Dobbs, C., Duinker, P., Gulsrud, N.M., Jim, C.Y ., Koeser, A.K., Landry, S., Livesley, S., Nesbitt, L., Sh...

work page doi:10.1016/j.healthplace.2008.09.006 2008
[3]

Cataloging public objects using aerial and street-level images — urban trees, in: 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 6014–

work page 2016
[4]

Xie, Q., Luong, M.T., Hovy, E., Le, Q.V ., 2020

doi:10.1109/CVPR.2016.647. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V ., 2020. Self- training with noisy student improves imagenet classification, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695. doi:10. 1109/CVPR42600.2020.01070. Yang, T., Zhou, S., Huang, Z., Xu, A., Ye, J., Yin, J., 2023. Ur- ban street tree...

work page doi:10.1109/cvpr.2016.647 2016
[5]

URL: https://github

Deformable detr: Deformable transformers for end- to-end object detection, in: ICLR 2021 Conference. URL: https://github. 13

work page 2021

[1] [1]

12 Maas, J., van Dillen, S.M., Verheij, R.A., Groenewegen, P.P.,

doi:10.1016/J.ISPRSJPRS.2021.01.016. 12 Maas, J., van Dillen, S.M., Verheij, R.A., Groenewegen, P.P.,

work page doi:10.1016/j.isprsjprs.2021.01.016 2021

[2] [2]

Health & Place 15, 586–595

Social contacts as a possible mechanism behind the relation between green space and health. Health & Place 15, 586–595. doi:10.1016/J.HEALTHPLACE.2008.09.006. Morgenroth, J., Doick, K., Hauer, R., Locke, D.H., Barona, C.O., Roman, L.A., Conway, T.M., Dobbs, C., Duinker, P., Gulsrud, N.M., Jim, C.Y ., Koeser, A.K., Landry, S., Livesley, S., Nesbitt, L., Sh...

work page doi:10.1016/j.healthplace.2008.09.006 2008

[3] [3]

Cataloging public objects using aerial and street-level images — urban trees, in: 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 6014–

work page 2016

[4] [4]

Xie, Q., Luong, M.T., Hovy, E., Le, Q.V ., 2020

doi:10.1109/CVPR.2016.647. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V ., 2020. Self- training with noisy student improves imagenet classification, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695. doi:10. 1109/CVPR42600.2020.01070. Yang, T., Zhou, S., Huang, Z., Xu, A., Ye, J., Yin, J., 2023. Ur- ban street tree...

work page doi:10.1109/cvpr.2016.647 2016

[5] [5]

URL: https://github

Deformable detr: Deformable transformers for end- to-end object detection, in: ICLR 2021 Conference. URL: https://github. 13

work page 2021