Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

Mangal Kothari; Nitik Jain

arxiv: 2605.15326 · v1 · pith:EKAA2X2Bnew · submitted 2026-05-14 · 💻 cs.CV

Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

Nitik Jain , Mangal Kothari This is my paper

Pith reviewed 2026-05-19 16:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal detectionforest canopy occlusionvisible-thermal fusionairborne optical sectioningLiDAR penetrationYOLOv5search and rescuesynthetic aperture imaging

0 comments

The pith

Multimodal fusion of thermal-visible imagery and airborne optical sectioning improves human detection under sparse forest canopy where LiDAR penetration proves limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates three complementary sensing methods for spotting humans hidden by forest vegetation. It tests how well terrestrial LiDAR penetrates canopy, fuses visible and thermal images to raise target contrast, and uses synthetic-aperture imaging to clear ground clutter. A detector fine-tuned on existing thermal data reaches roughly 0.83 mean average precision on the strongest classes. These steps together establish a workable starting point for UAV systems that must operate in real wooded terrain.

Core claim

A multimodal pipeline that evaluates LiDAR returns through vegetation, applies multi-scale transform and sparse-representation fusion to visible-thermal pairs, and forms synthetic-aperture images via Airborne Optical Sectioning can raise human saliency and ground-plane visibility in occluded forest scenes, yielding a fine-tuned YOLOv5 mean average precision of approximately 0.83 on the top three classes of the Teledyne FLIR thermal dataset.

What carries the argument

Multimodal proof-of-concept pipeline that pairs LiDAR penetration assessment with visible-thermal fusion and Airborne Optical Sectioning to suppress canopy clutter and enhance object saliency.

If this is right

Visible-thermal fusion raises target visibility in low-contrast forest scenes.
Airborne Optical Sectioning reduces canopy clutter and improves ground-plane detection on synthetic imagery.
The tested terrestrial LiDAR configuration shows limited penetration at object-detection scales.
Fine-tuned YOLOv5 reaches mean average precision near 0.83 on the strongest FLIR thermal classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion and sectioning steps could be adapted to other partially occluded settings such as urban foliage or post-disaster rubble.
Real-time onboard processing of the three modalities together would enable autonomous UAV search routes that do not rely on clear line-of-sight.
Collecting a dedicated forest-specific dataset with ground-truth labels would allow retraining to close the gap between synthetic and field performance.
Adding a fourth modality such as hyperspectral sensing might further separate vegetation signatures from human signatures.

Load-bearing premise

Results obtained on the Teledyne FLIR thermal dataset and on synthetic forest imagery will translate directly to real-world UAV or ground-based captures in actual sparse forest-canopy occlusion.

What would settle it

A controlled field experiment that flies a UAV over a real sparse forest, places human targets at varying depths under canopy, records simultaneous LiDAR, visible, and thermal streams, and measures whether the reported fusion and AOS gains appear in the actual detection rates.

Figures

Figures reproduced from arXiv: 2605.15326 by Mangal Kothari, Nitik Jain.

**Figure 2.** Figure 2: Remote-sensing sensor classes and operational regimes considered when selecting [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: CAT S60 smartphone used for proof-of-concept RGB and thermal image acquisition. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Airborne Optical Sectioning principle: multiple images captured over a synthetic [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: LiDAR experiment with target partially concealed behind ground-level vegetation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Point-cloud observation from the first LiDAR experiment. The target structure is not [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Elevated/top-down LiDAR observation through canopy. The scanner primarily registers [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: MST–SR visible–thermal image-fusion processing flow. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Example visible–thermal fusion result on a benchmark scene containing people under [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Self-acquired RGB–thermal fusion example with object/human signature partially [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Class distribution in the FLIR thermal dataset, showing dataset imbalance relevant [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: YOLOv5 inference results on FLIR thermal test imagery. Bounding boxes show [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Synthetic forest scene used for AOS evaluation. Human targets are placed on the [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Representative individual images from the synthetic multi-view sequence used for [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Final AOS integral image produced by combining the synthetic multi-view sequence. [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

read the original abstract

Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tests a multimodal detection pipeline on LiDAR, FLIR, and synthetic data but lacks real forest UAV experiments to back the UAV claims.

read the letter

Here's the quick take: this paper combines LiDAR penetration testing, visible-thermal fusion via multi-scale transforms, Airborne Optical Sectioning for clutter suppression, and a fine-tuned YOLOv5 into one pipeline aimed at human detection under sparse forest canopy. It's framed as a proof-of-concept to set a baseline for UAV search-and-rescue and surveillance. It does a solid job laying out the individual pieces. The LiDAR part shows limited penetration in their terrestrial setup for object-level work. Fusion helps make targets more visible in low-contrast cases. AOS improves ground-plane detection on the synthetic forest images. On the FLIR thermal data, the detector gets to about 0.83 mAP for the top classes. These are useful observations for anyone thinking about multimodal approaches in occluded settings. The main limitation is the evaluation data. The LiDAR is terrestrial, the thermal images come from the standard FLIR dataset rather than forest scenes, and the AOS results are on synthetic imagery. There are no descriptions of real UAV flights or actual ground-based captures in real forest canopy with proper occlusion ground truth. This leaves the transfer to the target UAV forest scenario as an open question. The paper is experimental and applies known methods without new derivations, so the strength rests on how well these tests represent the real problem. This would interest researchers in remote sensing and computer vision working on detection in natural environments, particularly those developing systems for search and rescue in wooded areas. It could spark ideas for better datasets or integration. I think it deserves peer review. The combination is reasonable for the application, and getting referee comments on the experimental design and data choices would help strengthen it. The authors are clear that more real forest data is needed next.

Referee Report

1 major / 1 minor

Summary. The paper presents a multimodal proof-of-concept pipeline for object detection under sparse forest-canopy occlusion. It combines experimental evaluation of terrestrial LiDAR penetration through vegetation, visible-thermal image fusion via multi-scale transform and sparse representation, Airborne Optical Sectioning (AOS) for canopy clutter suppression, and fine-tuning of YOLOv5 on the Teledyne FLIR thermal dataset. Reported outcomes include limited LiDAR penetration for object-level detection, improved target visibility from fusion in low-contrast scenes, enhanced ground-plane detection with AOS on synthetic imagery, and mAP of approximately 0.83 on the top three FLIR classes. These results are positioned as an initial baseline for UAV-deployable search-and-rescue and surveillance systems in forested environments.

Significance. If the central claims hold, the work provides a preliminary experimental baseline combining active sensing, fusion, and synthetic aperture techniques for a challenging remote-sensing problem. The quantitative mAP result on FLIR data and qualitative observations on fusion and AOS offer concrete starting points that could motivate dedicated forest datasets and real-time integration, though the absence of end-to-end testing in the target regime limits immediate impact.

major comments (1)

[Abstract, Results] Abstract and results paragraph: the central claim that the pipeline 'establishes an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments' is not supported by the described experiments. All quantitative and qualitative results derive from a terrestrial LiDAR rig, the Teledyne FLIR thermal dataset, and synthetic AOS forest imagery; no UAV flights, real canopy-occluded ground-truth captures, or evaluation on actual sparse forest data are reported. This mismatch directly undermines the translation to the stated operating regime.

minor comments (1)

[Results] The manuscript should clarify the precise evaluation protocol for the reported mAP (e.g., train/test split details, number of runs, confidence thresholds) to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract overstates the direct applicability of our proof-of-concept results to UAV systems in real forested environments, as the experiments rely on terrestrial LiDAR, the FLIR dataset, and synthetic AOS imagery. We will revise the claims to more precisely reflect the component-level baselines provided and their role in motivating future UAV work.

read point-by-point responses

Referee: [Abstract, Results] Abstract and results paragraph: the central claim that the pipeline 'establishes an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments' is not supported by the described experiments. All quantitative and qualitative results derive from a terrestrial LiDAR rig, the Teledyne FLIR thermal dataset, and synthetic AOS forest imagery; no UAV flights, real canopy-occluded ground-truth captures, or evaluation on actual sparse forest data are reported. This mismatch directly undermines the translation to the stated operating regime.

Authors: We acknowledge that the experiments do not include UAV flights or real-world sparse forest data with ground truth. The terrestrial LiDAR tests assess penetration feasibility relevant to canopy occlusion, the fusion experiments use the FLIR thermal dataset to demonstrate visibility improvements in low-contrast conditions, and AOS is evaluated on synthetic forest imagery to show clutter suppression. These provide targeted baselines for the core technical challenges. However, we agree the abstract wording implies a stronger translation to operational UAV systems than the current results support. We will revise the abstract and results section to state that the findings supply an initial multimodal baseline from these modalities and motivate dedicated UAV integration and forest datasets, removing the claim that the pipeline 'establishes' such a baseline for UAV-deployable systems. revision: yes

Circularity Check

0 steps flagged

No circularity: purely experimental evaluation with no derivations or self-referential reductions

full rationale

The paper contains no equations, derivations, or claimed first-principles predictions. It reports direct experimental results from terrestrial LiDAR penetration tests, visible-thermal fusion on existing imagery, AOS on synthetic forest data, and fine-tuning/evaluation of YOLOv5 on the Teledyne FLIR dataset. All quantitative findings (limited LiDAR penetration, fusion improvements, mAP ~0.83) are obtained by applying standard methods to chosen inputs without any reduction of outputs back to fitted parameters or self-citations by construction. The work is self-contained empirical baseline reporting and does not invoke uniqueness theorems, ansatzes, or prior author results as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the work relies on standard assumptions of the cited datasets and methods being representative of the target domain.

pith-pipeline@v0.9.0 · 5736 in / 1273 out tokens · 32201 ms · 2026-05-19T16:03:47.644895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Survey of computer vision algorithms and applications for unmanned aerial vehicles,

A. Al-Kaff, D. Martin, F. Garcia, A. de la Escalera, and J. M. Armingol, “Survey of computer vision algorithms and applications for unmanned aerial vehicles,”Expert Systems with Applications, vol. 92, pp. 447–463, 2018

work page 2018
[2]

You Only Look Once: Unified, Real-Time Object Detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016

work page 2016
[3]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

VIFB: A Visible and Infrared Image Fusion Benchmark,

X. Zhang, P. Ye, and G. Xiao, “VIFB: A Visible and Infrared Image Fusion Benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020

work page 2020
[5]

Image Fusion with Convolutional Sparse Representation,

Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image Fusion with Convolutional Sparse Representation,”IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1882–1886, 2016. 14

work page 2016
[6]

Light Field Rendering,

M. Levoy and P. Hanrahan, “Light Field Rendering,” inProceedings of SIGGRAPH, pp. 31–42, 1996

work page 1996
[7]

Airborne Optical Sectioning for Object Detection in Cluttered Environments,

I. Kurmi, D. C. Schedl, and O. Bimber, “Airborne Optical Sectioning for Object Detection in Cluttered Environments,”ISPRS Journal of Photogrammetry and Remote Sensing, 2020

work page 2020
[8]

Lightweight Multi-Drone Detection and 3D- Localization via YOLO,

A. Sharma, N. Jain, and M. Kothari, “Lightweight Multi-Drone Detection and 3D- Localization via YOLO,”arXiv preprint, 2021

work page 2021
[9]

Development of a Low Cost Autonomous Ground Vehicle,

N. Jain, A. A. Shah, H. Bollamreddi, and M. Kothari, “Development of a Low Cost Autonomous Ground Vehicle,” in2022 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 154–160, 2022

work page 2022
[10]

Convolutional Neural Network Based Sensors for Mobile Robot Relocalization,

H. Sinha, J. Patrikar, E. G. Dhekane, G. Pandey, and M. Kothari, “Convolutional Neural Network Based Sensors for Mobile Robot Relocalization,” in2018 23rd International Conference on Methods & Models in Automation & Robotics (MMAR), pp. 774–779, 2018. 15

work page 2018

[1] [1]

Survey of computer vision algorithms and applications for unmanned aerial vehicles,

A. Al-Kaff, D. Martin, F. Garcia, A. de la Escalera, and J. M. Armingol, “Survey of computer vision algorithms and applications for unmanned aerial vehicles,”Expert Systems with Applications, vol. 92, pp. 447–463, 2018

work page 2018

[2] [2]

You Only Look Once: Unified, Real-Time Object Detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016

work page 2016

[3] [3]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

VIFB: A Visible and Infrared Image Fusion Benchmark,

X. Zhang, P. Ye, and G. Xiao, “VIFB: A Visible and Infrared Image Fusion Benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020

work page 2020

[5] [5]

Image Fusion with Convolutional Sparse Representation,

Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image Fusion with Convolutional Sparse Representation,”IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1882–1886, 2016. 14

work page 2016

[6] [6]

Light Field Rendering,

M. Levoy and P. Hanrahan, “Light Field Rendering,” inProceedings of SIGGRAPH, pp. 31–42, 1996

work page 1996

[7] [7]

Airborne Optical Sectioning for Object Detection in Cluttered Environments,

I. Kurmi, D. C. Schedl, and O. Bimber, “Airborne Optical Sectioning for Object Detection in Cluttered Environments,”ISPRS Journal of Photogrammetry and Remote Sensing, 2020

work page 2020

[8] [8]

Lightweight Multi-Drone Detection and 3D- Localization via YOLO,

A. Sharma, N. Jain, and M. Kothari, “Lightweight Multi-Drone Detection and 3D- Localization via YOLO,”arXiv preprint, 2021

work page 2021

[9] [9]

Development of a Low Cost Autonomous Ground Vehicle,

N. Jain, A. A. Shah, H. Bollamreddi, and M. Kothari, “Development of a Low Cost Autonomous Ground Vehicle,” in2022 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 154–160, 2022

work page 2022

[10] [10]

Convolutional Neural Network Based Sensors for Mobile Robot Relocalization,

H. Sinha, J. Patrikar, E. G. Dhekane, G. Pandey, and M. Kothari, “Convolutional Neural Network Based Sensors for Mobile Robot Relocalization,” in2018 23rd International Conference on Methods & Models in Automation & Robotics (MMAR), pp. 774–779, 2018. 15

work page 2018