CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

Xian Li; Yanfeng Gu; Yanze Jiang

arxiv: 2606.09143 · v1 · pith:U2DLFSI2new · submitted 2026-06-08 · 💻 cs.CV

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

Yanze Jiang , Yanfeng Gu , Xian Li This is my paper

Pith reviewed 2026-06-27 17:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV 3D object detectionmultimodal fusionocclusion modelingLiDAR-camera fusionclosure-aware detectiontree canopy occlusionphysics-inspired modelingtop-down scene detection

0 comments

The pith

CAMF-Det improves UAV 3D object detection by modeling dual-modal occlusion from tree canopies with physics-inspired priors and embedding them throughout the pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that existing multimodal fusion methods fall short for UAV platforms because they ignore the spatially varying and modality-dependent occlusion caused by tree canopies in top-down views. It proposes CAMF-Det as a framework that first builds offline occlusion intensity ground truth for LiDAR and camera using a Beer-Lambert-inspired formulation plus building-mask correction, then trains an online prediction network from those maps, and finally injects both ground-truth and predicted values into data augmentation, feature encoding, multimodal fusion, and the detection head. The authors report that this produces the highest mAP_BEV scores across easy, moderate, and hard levels on two self-built UAV datasets, with the largest gains on hard cases. A sympathetic reader would care because ground-vehicle multimodal detectors do not transfer directly to UAV scenes dominated by canopy occlusion, and the work offers a concrete way to adapt fusion without requiring multi-frame inputs.

Core claim

CAMF-Det derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling

What carries the argument

Dual-modal closure modeling module that constructs occlusion intensity ground truth offline via a Beer-Lambert-inspired formulation and building-mask correction, then supervises an online prediction network whose outputs are injected at multiple pipeline stages.

If this is right

The framework enables adaptive detection by injecting occlusion priors at data augmentation, feature encoding, multimodal fusion, and detection head stages.
Performance gains are observed across all difficulty levels and are largest on hard examples dominated by occlusion.
The method supports single-frame inference once the online prediction network has been trained on offline ground-truth maps.
Both the ground-truth and the predicted occlusion intensity maps contribute to the final detection accuracy when used as priors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline modeling step could be applied to other elevated sensor platforms that encounter vegetation occlusion.
Replacing the building-mask correction with a learned semantic segmentation step might extend the approach to scenes without reliable building maps.
Testing whether the predicted occlusion intensities correlate with actual per-modality feature degradation on held-out UAV flights would provide an independent check on the modeling fidelity.

Load-bearing premise

The offline dual-modal closure modeling that constructs occlusion intensity ground truth via a Beer-Lambert-inspired formulation plus building-mask correction accurately represents real spatially varying and modality-dependent occlusion in UAV top-down scenes.

What would settle it

An ablation experiment in which the modeled occlusion intensity maps are replaced by random values or constant maps while keeping all other components fixed, and the reported mAP_BEV gains disappear, would show that the specific occlusion priors are not responsible for the performance improvements.

Figures

Figures reproduced from arXiv: 2606.09143 by Xian Li, Yanfeng Gu, Yanze Jiang.

**Figure 1.** Figure 1: Motivation of CAMF-Det. (a) UAV top-down observations suffer from spatially varying and modality-dependent degradation under canopy occlusion. (b) Existing multimodal methods underperform the LiDAR-only baseline, while CAMF-Det achieves the best performance across all occlusion levels. etrate canopy gaps, yielding sparse but still usable point clouds, whereas canopy directly obscures the color and texture … view at source ↗

**Figure 2.** Figure 2: Overall architecture of the proposed CAMF-Det. (a) DPM constructs dual-modal occlusion intensity ground truth offline from multi-strip dense point clouds. (b) DCPNet predicts occlusion intensity online by reusing intermediate features from the image and voxel encoders. (c) OPF embeds occlusion priors into detection pipeline. Green and orange arrows distinguish training-only and training/inference guidance,… view at source ↗

**Figure 3.** Figure 3: Architecture of the DPM module. The Beer–Lambert-inspired closure modeling principle (left) is applied to image-domain pixel frustums (middle) and LiDAR-domain BEV ground grids (right), with building-mask correction, to produce dual-modal occlusion intensity ground truth. into Eq.(3) yields Ce(Fu,v) for each pixel, and traversing all valid pixels produces the image-domain closure map Ceimg ∈ R H×W . 3.1.3.… view at source ↗

**Figure 4.** Figure 4: Architecture of the DCPNet. (a) Image-domain sub-network with multi-scale vegetation feature extraction and point cloud projection auxiliary encoding. (b) LiDAR-domain sub-network with terrain-adaptive perspective projection and geometric statistical feature extraction. The features Ffpn, Fveg, and Fpc are concatenated and fused through two convolutional layers. An occlusion decoder progressively reduces … view at source ↗

**Figure 5.** Figure 5: Architecture of the OPF strategy, which embeds occlusion priors into data augmentation, feature encoding, multimodal fusion, and detection head stages. Sub-modules (a) and (b) operate during training only, while (c) and (d) are active during both training and inference. 3.3. Occlusion Prior-Guided Fusion Strategy OPF employs four dedicated sub-modules to embed the occlusion intensity ground truth of DPM a… view at source ↗

**Figure 6.** Figure 6: Overview of the two UAV-based multimodal datasets. (a) Data acquisition platform. (b) Collection sites. (c) Large-scene point clouds and target categories. (d) Examples of image, point cloud, and LiDAR-to-image projection data illustrating cross-modal spatial alignment in representative scenes. canopy vehicle art-target cam-target exposed (0≤c<0.1) lightly (0.1≤ c <0.7) heavily (0.7≤ c ) (a) (b) [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: Distribution of target instances across exposed, lightly occluded, and heavily occluded levels on the two datasets. (a) SI3D-DI. (b) SI3D-DII. 4.2. Experimental Details Data preprocessing and network configuration. The point cloud contains 3D coordinates and reflectance intensity. The voxel size is set to 0.1 m. The point cloud detection ranges (XYZ axes) are set to [−64, −16, 0, 64, 16, 16] (m) for SI3D-D… view at source ↗

**Figure 8.** Figure 8: Visual comparison of detection results from BEVFusion, TG-ADet, and CAMF-Det on the SI3D-DI dataset. For each scene, the upper and lower rows show the image-domain and LiDAR-domain results, respectively. Circles highlight false positive and false negative instances [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Representative detection results of CAMF-Det across six scenes with varying occlusion conditions on the SI3D-DI dataset. For each scene, the first row shows image-domain results, and the second and third rows show LiDAR-domain results from different viewpoints [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Visual comparison of occlusion intensity predictions from the ablation networks, DCPNet, and the corresponding ground truth for both modalities on the test sets. The ablation network removes LiDAR projection features for the image modality and geometric statistical features for the LiDAR modality [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Overall framework ablation on SI3D-DI, showing detection performance as the image branch, DCPNet, and OPF are progressively introduced to TG-ADet. 0.111 and 0.146 on SI3D-DI and SI3D-DII, respectively. In the LiDAR modality, the MAE reaches 0.157 and 0.167 on the two datasets. The stable accuracy across both modalities indicates that DCPNet can reliably fit the dual-modal occlusion intensity ground truth… view at source ↗

read the original abstract

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAMF-Det brings explicit dual-modal occlusion priors into UAV LiDAR-camera detection via Beer-Lambert modeling, but the gains sit on private datasets and untested physical assumptions.

read the letter

The paper's main move is to build offline occlusion intensity maps for both LiDAR and camera using a Beer-Lambert transmittance model plus building-mask correction, then train an online predictor from those maps and feed the priors into augmentation, feature encoding, fusion, and the detection head. This targets the specific top-down tree canopy occlusion that ground-vehicle fusion papers ignore.

What is actually new is the end-to-end injection of modality-specific closure priors tailored to UAV scenes; the cited ground-vehicle work does not do this. The supervision stays external to the detection loss, which avoids circularity.

The reported hard-level mAP_BEV lifts (9.43 % and 4.88 %) are the headline numbers, but they come from two self-built datasets with no public release, no error bars, and no ablation that isolates the occlusion component. The stress-test concern lands: if uniform canopy density and the building-mask correction do not match real spatially varying, modality-dependent degradation, the supervision signal is systematically off and the downstream gains rest on an incorrect prior.

The modeling itself is simple and the pipeline integration is thorough, but the empirical support is thin without those checks. This is a narrow but practical contribution for aerial robotics and remote sensing groups working on occluded top-down detection. It is coherent on its own terms and deserves a serious referee to examine the full experiments and the fidelity of the physical maps.

Referee Report

3 major / 1 minor

Summary. The paper proposes CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms. It derives dual-modal occlusion intensity ground truth offline via a Beer-Lambert-inspired transmittance formulation with building-mask correction, trains an online prediction network under this supervision, and injects both ground-truth and predicted occlusion maps as priors into data augmentation, feature encoding, multimodal fusion, and the detection head to handle spatially varying, modality-dependent degradation from tree canopies. Experiments on two self-built datasets (SI3D-DI and SI3D-DII) report state-of-the-art results, including hard-level mAP_BEV gains of 9.43% and 4.88% over the best competing methods.

Significance. If the empirical results hold, the work would be significant for extending multimodal 3D detection to UAV top-down scenes, where occlusion is a dominant issue not addressed by ground-vehicle frameworks. The explicit use of physics-inspired modeling to generate supervision signals for occlusion awareness is a methodological strength that could inspire similar prior-injection approaches in other degraded sensing scenarios.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: the headline hard-level mAP_BEV gains of 9.43% and 4.88% are presented without baseline implementation details, error bars, multiple-run statistics, or ablation studies isolating the contribution of the dual-modal closure modeling and prior injection; this makes it impossible to verify whether the central performance claim is attributable to the proposed components.
[Dual-modal closure modeling module] Dual-modal closure modeling module: the offline construction of occlusion intensity ground truth via Beer-Lambert-inspired formulation plus building-mask correction is treated as faithful supervision for the online network and downstream injection, yet no validation (qualitative comparison to real sensor degradation, alternative modeling baselines, or sensitivity analysis on extinction coefficients) is supplied to confirm it captures modality-dependent, spatially varying occlusion in UAV canopy scenes.
[Experiments] Experiments section: the two datasets are self-built with no public release or access stated, and no cross-validation or external dataset results are reported; this directly undermines reproducibility of the load-bearing empirical claim.

minor comments (1)

[Abstract] Notation for mAP_BEV and difficulty levels should be defined at first use and cross-referenced to standard 3D detection metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, committing to revisions that strengthen clarity, validation, and reproducibility where possible without misrepresenting the current work.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the headline hard-level mAP_BEV gains of 9.43% and 4.88% are presented without baseline implementation details, error bars, multiple-run statistics, or ablation studies isolating the contribution of the dual-modal closure modeling and prior injection; this makes it impossible to verify whether the central performance claim is attributable to the proposed components.

Authors: We agree that additional implementation details and isolation of contributions would improve verifiability. In the revised manuscript, we will expand the Experiments section with full baseline implementation descriptions (including any UAV-specific adaptations), and we will augment the existing ablation studies to explicitly quantify the isolated effects of the dual-modal closure modeling module and the prior injection strategy. For error bars and multiple-run statistics, we will report averaged results over multiple random seeds where computational resources permit. revision: partial
Referee: [Dual-modal closure modeling module] Dual-modal closure modeling module: the offline construction of occlusion intensity ground truth via Beer-Lambert-inspired formulation plus building-mask correction is treated as faithful supervision for the online network and downstream injection, yet no validation (qualitative comparison to real sensor degradation, alternative modeling baselines, or sensitivity analysis on extinction coefficients) is supplied to confirm it captures modality-dependent, spatially varying occlusion in UAV canopy scenes.

Authors: We acknowledge that explicit validation of the offline modeling would strengthen the claim. The Beer-Lambert formulation is a physically grounded model for transmittance, with the building-mask correction tailored to UAV canopy scenes. In the revision, we will add qualitative side-by-side comparisons of the derived occlusion intensity maps against observed LiDAR point density drops and camera visibility degradation in sample scenes. We will also include sensitivity analysis on extinction coefficients and comparisons to alternative modeling baselines. revision: yes
Referee: [Experiments] Experiments section: the two datasets are self-built with no public release or access stated, and no cross-validation or external dataset results are reported; this directly undermines reproducibility of the load-bearing empirical claim.

Authors: We understand the reproducibility concern. The SI3D-DI and SI3D-DII datasets were purpose-built for UAV top-down canopy occlusion scenarios. In the revised manuscript, we will provide substantially more detail on acquisition setup, sensor parameters, flight trajectories, and annotation methodology. We will explicitly discuss the difficulty of cross-validation against ground-vehicle datasets (due to fundamentally different occlusion patterns) and will explore options for releasing a subset of the data or providing generation scripts where feasible. revision: partial

Circularity Check

0 steps flagged

No circularity: external physics-based supervision and standard supervised prediction

full rationale

The paper constructs occlusion intensity GT offline using a Beer-Lambert-inspired formulation plus building-mask correction (external physical model, not derived from detection outputs or fitted to task results). The dual-modal prediction network is then trained to predict these GT maps from single-frame inputs; this is ordinary supervised learning, not a prediction that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation chain. Performance claims rest on empirical results on held-out test sets rather than any definitional equivalence. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the Beer-Lambert law to model attenuation through canopies for both sensors and on the correctness of the building-mask correction step for generating supervision.

free parameters (1)

extinction coefficients
Parameters appearing in the Beer-Lambert-inspired formulation used to compute offline occlusion intensity maps for both modalities.

axioms (1)

domain assumption Beer-Lambert law applies to light and laser attenuation through tree canopies in top-down UAV views
Invoked to derive dual-modal occlusion intensity ground truth offline via the physics-inspired formulation.

pith-pipeline@v0.9.1-grok · 5847 in / 1435 out tokens · 29662 ms · 2026-06-27T17:08:52.655292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references

[1]

L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang, J. Li, C. Jia, Multi-modal 3D object detection in autonomous driving: A survey and taxonomy, IEEE Transactions on Intelligent Vehicles, 8 (2023) 3781-3798

2023
[2]

Y . Wang, Q. Mao, H. Zhu, J. Deng, Y . Zhang, J. Ji, H. Li, Y . Zhang, Multi- modal 3d object detection in autonomous driving: a survey, International Journal of Computer Vision, 131 (2023) 2122-2152

2023
[3]

Z. Song, L. Liu, F. Jia, Y . Luo, C. Jia, G. Zhang, L. Yang, L. Wang, Robustness-aware 3d object detection in autonomous driving: A review and outlook, IEEE Transactions on Intelligent Transportation Systems, 25 (2024) 15407-15436

2024
[4]

V ora, A.H

S. V ora, A.H. Lang, B. Helou, O. Beijbom, Pointpainting: Sequential fu- sion for 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4604-4612

2020
[5]

C. Wang, C. Ma, M. Zhu, X. Yang, Pointaugmenting: Cross-modal augmentation for 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11794- 11803

2021
[6]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D.L. Rus, S. Han, Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representa- tion, in: 2023 IEEE international conference on robotics and automation (ICRA), IEEE, 2023, pp. 2774-2781. 14

2023
[7]

Z. Ning, Z. Liu, X. Gao, Y . Zuo, J. Yang, Y . Fang, W. Liu, CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Pre- diction, IEEE Transactions on Circuits and Systems for Video Technology, (2025)

2025
[8]

B. Ding, J. Xie, J. Nie, J. Cao, SSLFusion: Scale and Space Aligned La- tent Fusion Model for Multimodal 3D Object Detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2025, pp. 2735-2743

2025
[9]

X. Bai, Z. Hu, X. Zhu, Q. Huang, Y . Chen, H. Fu, C.-L. Tai, Transfusion: Robust lidar-camera fusion for 3d object detection with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1090-1099

2022
[10]

Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility, Information Fusion, 122 (2025) 103158

2025
[11]

J. Su, X. Zhu, S. Li, W.-H. Chen, AI meets UA Vs: A survey on AI empow- ered UA V perception systems for precision agriculture, Neurocomputing, 518 (2023) 242-270

2023
[12]

H. Ye, R. Sunderraman, S. Ji, Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles, Advances in Neural Information Processing Systems, 37 (2024) 55425-55442

2024
[13]

Y . Hu, Y . Lu, R. Xu, W. Xie, S. Chen, Y . Wang, Collaboration helps camera overtake lidar in 3d detection, in: Proc. IEEE/CVF CVPR, 2023, pp. 9243-9252

2023
[14]

Meier, L

J. Meier, L. Scalerandi, O. Dhaouadi, J. Kaiser, N. Araslanov, D. Cremers, CARLA Drone: monocular 3d object detection from a different perspec- tive, in: DAGM German Conference on Pattern Recognition, Springer, 2024, pp. 137-152

2024
[15]

Hayton, T

J.N. Hayton, T. Barros, C. Premebida, M.J. Coombes, U.J. Nunes, CNN- based human detection using a 3D LiDAR onboard a UA V , in: 2020 IEEE international conference on autonomous robot systems and competitions (ICARSC), IEEE, 2020, pp. 312-318

2020
[16]

Cherif, H

B. Cherif, H. Ghazzai, A. Alsharoa, Lidar from the sky: Uav integration and fusion techniques for advanced traffic monitoring, IEEE Systems Journal, 18 (2024) 1639-1650

2024
[17]

Cherif, H

B. Cherif, H. Ghazzai, A. Alsharoa, H. Besbes, Y . Massoud, Aerial LiDAR-based 3D object detection and tracking for traffic monitoring, in: 2023 IEEE International symposium on circuits and systems (ISCAS), IEEE, 2023, pp. 1-5

2023
[18]

Y . Zhou, O. Tuzel, V oxelnet: End-to-end learning for point cloud based 3d object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490-4499

2018
[19]

A.H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, O. Beijbom, Pointpillars: Fast encoders for object detection from point clouds, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12697-12705

2019
[20]

Y . Yan, Y . Mao, B. Li, SECOND: Sparsely embedded convolutional detection, Sensors, 18 (10) (2018) 3337

2018
[21]

S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, H. Li, PV-RCNN++: Point-voxel feature set abstraction with local vector rep- resentation for 3D object detection, International Journal of Computer Vision, 131 (2023) 531-551

2023
[22]

Zhang, C

G. Zhang, C. Junnan, G. Gao, J. Li, X. Hu, Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds, Ad- vances in Neural Information Processing Systems, 36 (2023) 53076- 53089

2023
[23]

Y . Chen, J. Liu, X. Zhang, X. Qi, J. Jia, V oxelnext: Fully sparse voxelnet for 3d object detection and tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21674- 21683

2023
[24]

Zhang, J

G. Zhang, J. Chen, G. Gao, J. Li, S. Liu, X. Hu, Safdnet: A simple and effective network for fully sparse 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14477-14486

2024
[25]

L. Fan, F. Wang, N. Wang, Z. Zhang, Fsd v2: Improving fully sparse 3d object detection with virtual voxels, IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 (2024) 1279-1292

2024
[26]

L. Fan, Z. Pang, T. Zhang, Y .-X. Wang, H. Zhao, F. Wang, N. Wang, Z. Zhang, Embracing single stride 3d object detector with sparse transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8458-8468

2022
[27]

Z. Liu, X. Yang, H. Tang, S. Yang, S. Han, Flatformer: Flattened window attention for efficient point cloud transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1200-1211

2023
[28]

H. Wang, C. Shi, S. Shi, M. Lei, S. Wang, D. He, B. Schiele, L. Wang, Dsvt: Dynamic sparse voxel transformer with rotated sets, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 13520-13529

2023
[29]

Zhang, L

G. Zhang, L. Fan, C. He, Z. Lei, Z. Zhang, L. Zhang, V oxel mamba: Group-free state space models for point cloud based 3d object detection, Advances in Neural Information Processing Systems, 37 (2024) 81489- 81509

2024
[30]

J. Yin, J. Shen, R. Chen, W. Li, R. Yang, P. Frossard, W. Wang, Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14905-14915

2024
[31]

Zhang, C

G. Zhang, C. He, L. Chen, L. Zhang, BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2026, pp. 12448-12456

2026
[32]

Y . Li, Y . Yang, Z. Lei, CoreNet: Conflict Resolution Network for point- pixel misalignment and sub-task suppression of 3D LiDAR-camera object detection, Information Fusion, 118 (2025) 102896

2025
[33]

B. Wang, C. Xia, X. Gao, Y . Yang, B. Ge, K.-C. Li, Y . Zhang, PV-MM3D: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3D object detection, Information Fusion, (2025) 103983

2025
[34]

H. Wu, C. Wen, S. Shi, X. Li, C. Wang, Virtual sparse convolution for multimodal 3d object detection, in: Proceedings of the IEEE/CVF CVPR, 2023, pp. 21653-21662

2023
[35]

Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, Deformable feature aggregation for dynamic multi-modal 3D object detection, in: European conference on computer vision, Springer, 2022, pp. 628-644

2022
[36]

Z. Song, G. Zhang, J. Xie, L. Liu, C. Jia, S. Xu, Z. Wang, V oxelNext- Fusion: A Simple, Unified, and Effective V oxel Fusion Framework for Multimodal 3-D Object Detection, IEEE Transactions on Geoscience and Remote Sensing, 61 (2023) 1-12

2023
[37]

Zhang, L

H. Zhang, L. Liang, P. Zeng, X. Song, Z. Wang, SparseLIF: High- performance sparse LiDAR-camera fusion for 3D object detection, in: European conference on computer vision, Springer, 2024, pp. 109-128

2024
[38]

Y . Hu, S. Fang, W. Xie, S. Chen, Aerial monocular 3d object detection, IEEE Robotics and Automation Letters, 8 (2023) 1959-1966

2023
[39]

P. Tian, Z. Wang, P. Cheng, Y . Wang, Z. Wang, L. Zhao, M. Yan, X. Yang, X. Sun, Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping, IEEE Transactions on Geoscience and Remote Sensing, 63 (2024) 1-16

2024
[40]

Y . Wang, Z. Wang, P. Cheng, P. Tian, Z. Yuan, J. Tian, W. Wang, L. Zhao, Uvcpnet: A uav-vehicle collaborative perception network for 3d object detection, IEEE Transactions on Geoscience and Remote Sensing, (2025)

2025
[41]

Shrout, O

O. Shrout, O. Nizan, Y . Ben-Shabat, A. Tal, WiSAR3D-Aerial LiDAR Dataset for 3D Object Detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 6580-6589

2026
[42]

Liang, L

A. Liang, L. Kong, D. Lu, Y . Liu, J. Fang, H. Zhao, W.T. Ooi, Perspective- invariant 3D object detection, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 27725-27738

2025
[43]

Jiang, X

Y . Jiang, X. Li, Y . Gu, TG-ADet: Terrain-Guided Network for 3D Object Detection in ALS Point Clouds, IEEE Transactions on Geoscience and Remote Sensing, (2025)

2025
[44]

J. Wang, X. Cao, J. Zhong, Y . Zhang, Z. Han, H. Yu, C. Zhang, L. He, S. Xu, J. Wang, Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2026, pp. 9867-9875

2026
[45]

Monsi, T

M. Monsi, T. Saeki, On the factor light in plant communities and its importance for matter production, Annals of botany, 95 (2005) 549-567

2005
[46]

Korhonen, I

L. Korhonen, I. Korpela, J. Heiskanen, M. Maltamo, Airborne discrete- return LIDAR data in the estimation of vertical canopy cover, angular canopy closure and leaf area index, Remote Sensing of Environment, 115 (2011) 1065-1080. 15

2011

[1] [1]

L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang, J. Li, C. Jia, Multi-modal 3D object detection in autonomous driving: A survey and taxonomy, IEEE Transactions on Intelligent Vehicles, 8 (2023) 3781-3798

2023

[2] [2]

Y . Wang, Q. Mao, H. Zhu, J. Deng, Y . Zhang, J. Ji, H. Li, Y . Zhang, Multi- modal 3d object detection in autonomous driving: a survey, International Journal of Computer Vision, 131 (2023) 2122-2152

2023

[3] [3]

Z. Song, L. Liu, F. Jia, Y . Luo, C. Jia, G. Zhang, L. Yang, L. Wang, Robustness-aware 3d object detection in autonomous driving: A review and outlook, IEEE Transactions on Intelligent Transportation Systems, 25 (2024) 15407-15436

2024

[4] [4]

V ora, A.H

S. V ora, A.H. Lang, B. Helou, O. Beijbom, Pointpainting: Sequential fu- sion for 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4604-4612

2020

[5] [5]

C. Wang, C. Ma, M. Zhu, X. Yang, Pointaugmenting: Cross-modal augmentation for 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11794- 11803

2021

[6] [6]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D.L. Rus, S. Han, Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representa- tion, in: 2023 IEEE international conference on robotics and automation (ICRA), IEEE, 2023, pp. 2774-2781. 14

2023

[7] [7]

Z. Ning, Z. Liu, X. Gao, Y . Zuo, J. Yang, Y . Fang, W. Liu, CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Pre- diction, IEEE Transactions on Circuits and Systems for Video Technology, (2025)

2025

[8] [8]

B. Ding, J. Xie, J. Nie, J. Cao, SSLFusion: Scale and Space Aligned La- tent Fusion Model for Multimodal 3D Object Detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2025, pp. 2735-2743

2025

[9] [9]

X. Bai, Z. Hu, X. Zhu, Q. Huang, Y . Chen, H. Fu, C.-L. Tai, Transfusion: Robust lidar-camera fusion for 3d object detection with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1090-1099

2022

[10] [10]

Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility, Information Fusion, 122 (2025) 103158

2025

[11] [11]

J. Su, X. Zhu, S. Li, W.-H. Chen, AI meets UA Vs: A survey on AI empow- ered UA V perception systems for precision agriculture, Neurocomputing, 518 (2023) 242-270

2023

[12] [12]

H. Ye, R. Sunderraman, S. Ji, Uav3d: A large-scale 3d perception bench- mark for unmanned aerial vehicles, Advances in Neural Information Processing Systems, 37 (2024) 55425-55442

2024

[13] [13]

Y . Hu, Y . Lu, R. Xu, W. Xie, S. Chen, Y . Wang, Collaboration helps camera overtake lidar in 3d detection, in: Proc. IEEE/CVF CVPR, 2023, pp. 9243-9252

2023

[14] [14]

Meier, L

J. Meier, L. Scalerandi, O. Dhaouadi, J. Kaiser, N. Araslanov, D. Cremers, CARLA Drone: monocular 3d object detection from a different perspec- tive, in: DAGM German Conference on Pattern Recognition, Springer, 2024, pp. 137-152

2024

[15] [15]

Hayton, T

J.N. Hayton, T. Barros, C. Premebida, M.J. Coombes, U.J. Nunes, CNN- based human detection using a 3D LiDAR onboard a UA V , in: 2020 IEEE international conference on autonomous robot systems and competitions (ICARSC), IEEE, 2020, pp. 312-318

2020

[16] [16]

Cherif, H

B. Cherif, H. Ghazzai, A. Alsharoa, Lidar from the sky: Uav integration and fusion techniques for advanced traffic monitoring, IEEE Systems Journal, 18 (2024) 1639-1650

2024

[17] [17]

Cherif, H

B. Cherif, H. Ghazzai, A. Alsharoa, H. Besbes, Y . Massoud, Aerial LiDAR-based 3D object detection and tracking for traffic monitoring, in: 2023 IEEE International symposium on circuits and systems (ISCAS), IEEE, 2023, pp. 1-5

2023

[18] [18]

Y . Zhou, O. Tuzel, V oxelnet: End-to-end learning for point cloud based 3d object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490-4499

2018

[19] [19]

A.H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, O. Beijbom, Pointpillars: Fast encoders for object detection from point clouds, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12697-12705

2019

[20] [20]

Y . Yan, Y . Mao, B. Li, SECOND: Sparsely embedded convolutional detection, Sensors, 18 (10) (2018) 3337

2018

[21] [21]

S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, H. Li, PV-RCNN++: Point-voxel feature set abstraction with local vector rep- resentation for 3D object detection, International Journal of Computer Vision, 131 (2023) 531-551

2023

[22] [22]

Zhang, C

G. Zhang, C. Junnan, G. Gao, J. Li, X. Hu, Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds, Ad- vances in Neural Information Processing Systems, 36 (2023) 53076- 53089

2023

[23] [23]

Y . Chen, J. Liu, X. Zhang, X. Qi, J. Jia, V oxelnext: Fully sparse voxelnet for 3d object detection and tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21674- 21683

2023

[24] [24]

Zhang, J

G. Zhang, J. Chen, G. Gao, J. Li, S. Liu, X. Hu, Safdnet: A simple and effective network for fully sparse 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14477-14486

2024

[25] [25]

L. Fan, F. Wang, N. Wang, Z. Zhang, Fsd v2: Improving fully sparse 3d object detection with virtual voxels, IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 (2024) 1279-1292

2024

[26] [26]

L. Fan, Z. Pang, T. Zhang, Y .-X. Wang, H. Zhao, F. Wang, N. Wang, Z. Zhang, Embracing single stride 3d object detector with sparse transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8458-8468

2022

[27] [27]

Z. Liu, X. Yang, H. Tang, S. Yang, S. Han, Flatformer: Flattened window attention for efficient point cloud transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1200-1211

2023

[28] [28]

H. Wang, C. Shi, S. Shi, M. Lei, S. Wang, D. He, B. Schiele, L. Wang, Dsvt: Dynamic sparse voxel transformer with rotated sets, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 13520-13529

2023

[29] [29]

Zhang, L

G. Zhang, L. Fan, C. He, Z. Lei, Z. Zhang, L. Zhang, V oxel mamba: Group-free state space models for point cloud based 3d object detection, Advances in Neural Information Processing Systems, 37 (2024) 81489- 81509

2024

[30] [30]

J. Yin, J. Shen, R. Chen, W. Li, R. Yang, P. Frossard, W. Wang, Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14905-14915

2024

[31] [31]

Zhang, C

G. Zhang, C. He, L. Chen, L. Zhang, BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2026, pp. 12448-12456

2026

[32] [32]

Y . Li, Y . Yang, Z. Lei, CoreNet: Conflict Resolution Network for point- pixel misalignment and sub-task suppression of 3D LiDAR-camera object detection, Information Fusion, 118 (2025) 102896

2025

[33] [33]

B. Wang, C. Xia, X. Gao, Y . Yang, B. Ge, K.-C. Li, Y . Zhang, PV-MM3D: Point-voxel parallel dual-stream framework with dual-attention region adaptive fusion for multimodal 3D object detection, Information Fusion, (2025) 103983

2025

[34] [34]

H. Wu, C. Wen, S. Shi, X. Li, C. Wang, Virtual sparse convolution for multimodal 3d object detection, in: Proceedings of the IEEE/CVF CVPR, 2023, pp. 21653-21662

2023

[35] [35]

Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, Deformable feature aggregation for dynamic multi-modal 3D object detection, in: European conference on computer vision, Springer, 2022, pp. 628-644

2022

[36] [36]

Z. Song, G. Zhang, J. Xie, L. Liu, C. Jia, S. Xu, Z. Wang, V oxelNext- Fusion: A Simple, Unified, and Effective V oxel Fusion Framework for Multimodal 3-D Object Detection, IEEE Transactions on Geoscience and Remote Sensing, 61 (2023) 1-12

2023

[37] [37]

Zhang, L

H. Zhang, L. Liang, P. Zeng, X. Song, Z. Wang, SparseLIF: High- performance sparse LiDAR-camera fusion for 3D object detection, in: European conference on computer vision, Springer, 2024, pp. 109-128

2024

[38] [38]

Y . Hu, S. Fang, W. Xie, S. Chen, Aerial monocular 3d object detection, IEEE Robotics and Automation Letters, 8 (2023) 1959-1966

2023

[39] [39]

P. Tian, Z. Wang, P. Cheng, Y . Wang, Z. Wang, L. Zhao, M. Yan, X. Yang, X. Sun, Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping, IEEE Transactions on Geoscience and Remote Sensing, 63 (2024) 1-16

2024

[40] [40]

Y . Wang, Z. Wang, P. Cheng, P. Tian, Z. Yuan, J. Tian, W. Wang, L. Zhao, Uvcpnet: A uav-vehicle collaborative perception network for 3d object detection, IEEE Transactions on Geoscience and Remote Sensing, (2025)

2025

[41] [41]

Shrout, O

O. Shrout, O. Nizan, Y . Ben-Shabat, A. Tal, WiSAR3D-Aerial LiDAR Dataset for 3D Object Detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 6580-6589

2026

[42] [42]

Liang, L

A. Liang, L. Kong, D. Lu, Y . Liu, J. Fang, H. Zhao, W.T. Ooi, Perspective- invariant 3D object detection, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 27725-27738

2025

[43] [43]

Jiang, X

Y . Jiang, X. Li, Y . Gu, TG-ADet: Terrain-Guided Network for 3D Object Detection in ALS Point Clouds, IEEE Transactions on Geoscience and Remote Sensing, (2025)

2025

[44] [44]

J. Wang, X. Cao, J. Zhong, Y . Zhang, Z. Han, H. Yu, C. Zhang, L. He, S. Xu, J. Wang, Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2026, pp. 9867-9875

2026

[45] [45]

Monsi, T

M. Monsi, T. Saeki, On the factor light in plant communities and its importance for matter production, Annals of botany, 95 (2005) 549-567

2005

[46] [46]

Korhonen, I

L. Korhonen, I. Korpela, J. Heiskanen, M. Maltamo, Airborne discrete- return LIDAR data in the estimation of vertical canopy cover, angular canopy closure and leaf area index, Remote Sensing of Environment, 115 (2011) 1065-1080. 15

2011