pith. sign in

arxiv: 2212.11538 · v2 · submitted 2022-12-22 · 💻 cs.CV

SHLE: Devices Tracking and Depth Filtering for Stereo-based Height Limit Estimation

Pith reviewed 2026-05-24 10:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords height limit estimationstereo visiondevice trackingdepth filteringover-height vehicleDisparity Height datasetcomputer vision pipeline
0
0 comments X

The pith

Stereo pipeline tracks height limit devices then filters depth measurements over time to estimate their clearance with under 10 cm average error at 70 m range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SHLE as a two-stage stereo vision system that first locates and follows height-limiting objects such as bridges or signs across video frames, then repeatedly samples depth at those locations, extracts stable points, and applies temporal filtering to arrive at a height value. This addresses frequent over-height vehicle collisions by giving drivers advance warning inside ordinary cars. The authors support the claim by releasing a new dataset of stereo pairs with disparity maps and annotated heights, then showing that the full pipeline beats prior methods while keeping error low even when the car is far away. A sympathetic reader would see the work as turning noisy stereo data into a practical, low-cost alert signal through tracking plus filtering rather than single-frame depth.

Core claim

SHLE achieves an average error below 10 cm even when the car is 70 m from the devices by first detecting and tracking the height limit objects in the left or right image, then temporally measuring, extracting, and filtering depth values to compute the limit; the method outperforms all compared baselines on the Disparity Height dataset and reaches state-of-the-art performance.

What carries the argument

The SHLE two-stage pipeline: devices detection and tracking followed by depth measurement, extraction, and filtering.

If this is right

  • Vehicles equipped with stereo cameras can generate real-time height alerts without expensive sensors.
  • Early detection at long range gives drivers time to adjust speed or route.
  • The same tracking-plus-filtering approach can be applied to other roadside objects whose clearance matters.
  • The released Disparity Height dataset provides a common test bed for future stereo height methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be combined with map data so that once a device is measured its height is stored for later trips.
  • If depth filtering proves robust in rain or at night, the same pipeline might extend to other low-light traffic safety tasks.
  • Integration with vehicle CAN bus data could allow automatic speed reduction when an over-height risk is confirmed.

Load-bearing premise

The depth filtering stage can reliably isolate and stabilize measurements to the tracked device across frames despite stereo matching noise, occlusions, or scene motion.

What would settle it

Run the pipeline on stereo video sequences of known-height devices at 70 m distance and measure whether the average absolute error exceeds 10 cm.

Figures

Figures reproduced from arXiv: 2212.11538 by Hongyan Liu, Jun He, Kaixing Yang, Min Zhang, Zhaoxin Fan, Zhenbo Song.

Figure 1
Figure 1. Figure 1: Over-height vehicle strike happen accident as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: display effect for cameras estimation results. To this end, we can get accurate height limit estimation. Note we are the first work of proposing vision based meth￾ods for height limit estimation for modern cars. Therefore, there is no public available dataset that we can use. To benchmark our task, we propose a novel large-scale dataset named ”Disparity Height”. ”Disparity Height” is collected in natural o… view at source ↗
Figure 3
Figure 3. Figure 3: SHLE. For each frame fi , SHLE takes disparity map Di and RGB image I as input and outputs the height h of corresponding height limit device. For each scene, SHLE will generate a scene-level height after collecting all frames’ output. In stage1, for each frame, we firstly execute object detection by Height Limit Device Detector F(∗) to a get bounding box b of RGB image I, secondly apply object tracking by … view at source ↗
Figure 4
Figure 4. Figure 4: Compare between Predict and Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Object Tracking For example, our data set is a collection of image sequences taken in different scenes. Suppose a scene contains a total of M frames of image sequences with valid height limit devices, but the images with valid prediction boxes detected by object detection method may be less than M frames, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Frustum   xw yw zw   = R−1     xc yc zc   − T   (4) Finally, we add the mounting height to points’ y-axis. However, yw at point A ∈ pw does not yet represent the real world height. yw only represents the relative height of A and stereo camera. At this point, it is also necessary to know the mounting height of stereo camera Hm, and Hm is 1.45m in this paper. So y-axis value y 0 w of the point A … view at source ↗
Figure 7
Figure 7. Figure 7: Pixel Extension Firstly, we conduct pixel extension. We extend the lower boundary of the predicted box, then execute frustum-based target extractor for the extended points, finally back-project them into a point cloud. For height limit estimation task, it is important to obtain the lower edge line of the device if the height in 3D space can be accurately calculated. Existing object detection methods, even … view at source ↗
Figure 8
Figure 8. Figure 8: Incorrect Contain. Red bounding box is our labeled [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Kernel Density Estimation Third, we conduct kernel density estimation [62]. We first believe that the probability distribution of depth of the points in object detection boundary box should be similar to the normal distribution, that is, bell-shaped, low at the ends and high in the middle. Thus, we first treat its distribution as normal distribution with big standard deviation. Then, interval center point … view at source ↗
Figure 12
Figure 12. Figure 12: Data Annotation V. EXPERIMENT A. Dataset and Metric Since we are the first to utilize vision based methods for height limit estimation task. Therefore, there is no public available dataset that we can use. To benchmark our task, we propose a novel large-scale dataset named ”Disparity Height”. For shooting setting, the baseline of our stereo camera is 120 mm, the camera mounting height is 1.45 m, the resol… view at source ↗
Figure 13
Figure 13. Figure 13: visualisation effect of SHLE Our trained model is executed on demo data rather than training or validation data, [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Hyper-parameter. Fig. 14 shows the specific hyper-parameter process, taking [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
read the original abstract

Recently, over-height vehicle strike frequently occurs, causing great economic cost and serious safety problems. Hence, an alert system which can accurately discover any possible height limiting devices in advance is necessary to be employed in modern large or medium sized cars, such as touring cars. Detecting and estimating the height limiting devices act as the key point of a successful height limit alert system. Though there are some works research height limit estimation, existing methods are either too computational expensive or not accurate enough. In this paper, we propose a novel stereo-based pipeline named SHLE for height limit estimation. Our SHLE pipeline consists of two stages. In stage 1, a novel devices detection and tracking scheme is introduced, which accurately locate the height limit devices in the left or right image. Then, in stage 2, the depth is temporally measured, extracted and filtered to calculate the height limit device. To benchmark the height limit estimation task, we build a large-scale dataset named "Disparity Height", where stereo images, pre-computed disparities and ground-truth height limit annotations are provided. We conducted extensive experiments on "Disparity Height" and the results show that SHLE achieves an average error below than 10cm though the car is 70m away from the devices. Our method also outperforms all compared baselines and achieves state-of-the-art performance. Code is available at https://github.com/Yang-Kaixing/SHLE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes SHLE, a two-stage stereo pipeline for height-limit estimation in vehicles. Stage 1 detects and tracks height-limit devices in left/right images; stage 2 performs temporal depth measurement, extraction, and filtering to compute device heights. A new 'Disparity Height' dataset is introduced containing stereo images, pre-computed disparities, and ground-truth height annotations. Experiments on this dataset are reported to show average height error below 10 cm at distances up to 70 m, with SHLE outperforming all baselines and achieving state-of-the-art performance. Code is released at a public GitHub repository.

Significance. If the temporal filtering stage can demonstrably suppress stereo-matching noise, occlusions, and scene motion to the precision needed for sub-10 cm height error at 70 m, the work would offer a practical, deployable component for automotive height-limit alert systems. Public release of code and a new benchmark dataset are concrete strengths that support reproducibility.

major comments (3)
  1. [Abstract] Abstract: the headline claim of average error below 10 cm at 70 m is load-bearing for the contribution, yet the manuscript provides neither distance-binned error statistics nor quantitative evidence that the stage-2 filtering reduces effective disparity error sufficiently to overcome the quadratic growth of stereo depth uncertainty (standard propagation δd ≈ (d²/(f·b))·δdisp).
  2. [Stage 2] Stage 2 (depth filtering): the description of 'temporally measured, extracted and filtered' depth lacks the concrete algorithm (median, Kalman, or other), ablation studies, or variance-reduction measurements needed to evaluate whether it can isolate device measurements under the dataset's noise, occlusion, and motion conditions.
  3. [Experiments] Experiments section: no table or figure reports error versus distance, dataset distance distribution, or long-range sample counts; without these the SOTA claim and the 70 m result cannot be verified against the known quadratic error scaling.
minor comments (3)
  1. [Abstract] Abstract contains the ungrammatical phrase 'below than 10cm'; correct to 'below 10 cm'.
  2. Dataset statistics (number of sequences, distance histogram, number of devices at >50 m) are not reported, hindering assessment of the benchmark's difficulty and coverage.
  3. Baseline implementations and training details are referenced only generically; explicit citations or configuration tables would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve verifiability of the results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of average error below 10 cm at 70 m is load-bearing for the contribution, yet the manuscript provides neither distance-binned error statistics nor quantitative evidence that the stage-2 filtering reduces effective disparity error sufficiently to overcome the quadratic growth of stereo depth uncertainty (standard propagation δd ≈ (d²/(f·b))·δdisp).

    Authors: We agree that distance-binned statistics and explicit evidence on filtering efficacy would strengthen the claim. In revision we will add a table/figure with mean absolute error binned by distance (including per-bin sample counts) and a quantitative comparison of disparity variance before versus after temporal filtering to demonstrate reduction relative to the quadratic uncertainty scaling. revision: yes

  2. Referee: [Stage 2] Stage 2 (depth filtering): the description of 'temporally measured, extracted and filtered' depth lacks the concrete algorithm (median, Kalman, or other), ablation studies, or variance-reduction measurements needed to evaluate whether it can isolate device measurements under the dataset's noise, occlusion, and motion conditions.

    Authors: We will expand the Stage 2 section to name the exact filtering algorithm and its parameters, add ablation results (with/without filtering), and report measured variance reduction on the disparity values under the dataset conditions. revision: yes

  3. Referee: [Experiments] Experiments section: no table or figure reports error versus distance, dataset distance distribution, or long-range sample counts; without these the SOTA claim and the 70 m result cannot be verified against the known quadratic error scaling.

    Authors: We will add to the Experiments section a plot or table of error versus distance, the distance histogram of the Disparity Height dataset, and explicit counts of samples at long ranges (including near 70 m) to support verification. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pipeline evaluated on held-out dataset

full rationale

The SHLE pipeline is a two-stage stereo vision method (device detection/tracking then temporal depth extraction/filtering) whose performance claims are supported solely by empirical results on the independently annotated 'Disparity Height' dataset. No equations, derivations, or 'predictions' are presented that reduce by construction to fitted inputs, self-citations, or ansatzes; the method contains no load-bearing uniqueness theorems or self-referential definitions. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the method is presented as an empirical pipeline.

pith-pipeline@v0.9.0 · 5796 in / 994 out tokens · 24906 ms · 2026-05-24T10:30:59.411381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

  1. [1]

    ”Stereo vision-Facing the challenges and seeing the op- portunities for ADAS applications.” Texas Instruments Technical Note (2016)

    Dubey, Aish. ”Stereo vision-Facing the challenges and seeing the op- portunities for ADAS applications.” Texas Instruments Technical Note (2016)

  2. [2]

    ”Distance measurement system for au- tonomous vehicles using stereo camera.” Array 5 (2020): 100016

    Zaarane, Abdelmoghit, et al. ”Distance measurement system for au- tonomous vehicles using stereo camera.” Array 5 (2020): 100016

  3. [4]

    ”Spatial pyramid pooling in deep convolutional networks for visual recognition.” IEEE transactions on pattern analysis and machine intelligence 37.9 (2015): 1904-1916

    He, Kaiming, et al. ”Spatial pyramid pooling in deep convolutional networks for visual recognition.” IEEE transactions on pattern analysis and machine intelligence 37.9 (2015): 1904-1916

  4. [5]

    Vision-based over-height vehicle detection for warning drivers

    Nguyen, Bella. Vision-based over-height vehicle detection for warning drivers. Diss. University of Cambridge, 2018

  5. [6]

    ” 车载限高障碍物检测系统的设计与实现.” 电光系统 2 (2018): 13-17

    刘梦. ” 车载限高障碍物检测系统的设计与实现.” 电光系统 2 (2018): 13-17

  6. [7]

    ” 激光雷达辅助驾驶道路参数计算方法研究.” 应用光学 41.1 (2020): 209

    游安清, et al. ” 激光雷达辅助驾驶道路参数计算方法研究.” 应用光学 41.1 (2020): 209

  7. [8]

    车载道路限制几何信息测量和超高预警方法研究

    张阔. 车载道路限制几何信息测量和超高预警方法研究. MS thesis. 燕 山大学, 2014

  8. [9]

    ”Detection of individual trees and estimation of tree height using LiDAR data.” Journal of Forest Research 12.6 (2007): 425-434

    Kwak, Doo-Ahn, et al. ”Detection of individual trees and estimation of tree height using LiDAR data.” Journal of Forest Research 12.6 (2007): 425-434

  9. [10]

    Rosette, J. A. B., P. R. J. North, and J. C. Suarez. ”Vegetation height estimates for a mixed temperate forest using satellite laser altimetry.” International journal of remote sensing 29.5 (2008): 1475-1493

  10. [11]

    ”Crop height monitoring with digital imagery from Unmanned Aerial System (UAS).” Computers and Electronics in Agriculture 141 (2017): 232-237

    Chang, Anjin, et al. ”Crop height monitoring with digital imagery from Unmanned Aerial System (UAS).” Computers and Electronics in Agriculture 141 (2017): 232-237

  11. [12]

    ”Biomass and crop height estimation of different crops using UA V-based LiDAR.” Remote Sensing 12.1 (2019): 17

    ten Harkel, Jelle, Harm Bartholomeus, and Lammert Kooistra. ”Biomass and crop height estimation of different crops using UA V-based LiDAR.” Remote Sensing 12.1 (2019): 17

  12. [13]

    ”Wheat height estimation using LiDAR in compar- ison to ultrasonic sensor and UAS.” Sensors 18.11 (2018): 3731

    Yuan, Wenan, et al. ”Wheat height estimation using LiDAR in compar- ison to ultrasonic sensor and UAS.” Sensors 18.11 (2018): 3731

  13. [14]

    ”Regression kriging for improving crop height models fusing ultra-sonic sensing with UA V imagery.” Remote Sensing 9.7 (2017): 665

    Schirrmann, Michael, et al. ”Regression kriging for improving crop height models fusing ultra-sonic sensing with UA V imagery.” Remote Sensing 9.7 (2017): 665

  14. [15]

    ”Global canopy height regression and uncertainty estimation from GEDI LIDAR waveforms with deep ensembles.” Remote Sensing of Environment 268 (2022): 112760

    Lang, Nico, et al. ”Global canopy height regression and uncertainty estimation from GEDI LIDAR waveforms with deep ensembles.” Remote Sensing of Environment 268 (2022): 112760

  15. [16]

    ”Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information process- ing systems 28 (2015)

    Ren, Shaoqing, et al. ”Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information process- ing systems 28 (2015)

  16. [17]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional net- works for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014)

  17. [18]

    ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition

    He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016

  18. [19]

    ”Mobilenetv2: Inverted residuals and linear bot- tlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Sandler, Mark, et al. ”Mobilenetv2: Inverted residuals and linear bot- tlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018

  19. [20]

    ”Searching for mobilenetv3.” Proceedings of the IEEE/CVF international conference on computer vision

    Howard, Andrew, et al. ”Searching for mobilenetv3.” Proceedings of the IEEE/CVF international conference on computer vision. 2019

  20. [21]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, Alexey, et al. ”An image is worth 16x16 words: Trans- formers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020)

  21. [22]

    ”Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, Ze, et al. ”Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

  22. [23]

    ”Vivit: A video vision transformer.” Proceedings of the IEEE/CVF International Conference on Computer Vision

    Arnab, Anurag, et al. ”Vivit: A video vision transformer.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

  23. [24]

    ”Fully convolu- tional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolu- tional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

  24. [25]

    ”U-net: Convo- lutional networks for biomedical image segmentation.” International Con- ference on Medical image computing and computer-assisted intervention

    Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convo- lutional networks for biomedical image segmentation.” International Con- ference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015

  25. [26]

    ”Unet++: A nested u-net architecture for medical image segmentation.” Deep learning in medical image analysis and multimodal learning for clinical decision support

    Zhou, Zongwei, et al. ”Unet++: A nested u-net architecture for medical image segmentation.” Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2018. 3-11

  26. [27]

    ”H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes.” IEEE transactions on medical imaging 37.12 (2018): 2663-2674

    Li, Xiaomeng, et al. ”H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes.” IEEE transactions on medical imaging 37.12 (2018): 2663-2674

  27. [28]

    ”Robust object tracking with online multiple instance learning.” IEEE transactions on pattern analysis and machine intelligence 33.8 (2010): 1619-1632

    Babenko, Boris, Ming-Hsuan Yang, and Serge Belongie. ”Robust object tracking with online multiple instance learning.” IEEE transactions on pattern analysis and machine intelligence 33.8 (2010): 1619-1632

  28. [29]

    ”High-speed tracking with kernelized cor- relation filters.” IEEE transactions on pattern analysis and machine intelligence 37.3 (2014): 583-596

    Henriques, Jo ˜ao F., et al. ”High-speed tracking with kernelized cor- relation filters.” IEEE transactions on pattern analysis and machine intelligence 37.3 (2014): 583-596

  29. [30]

    ”Discriminative correlation filter with channel and spatial reliability.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Lukezic, Alan, et al. ”Discriminative correlation filter with channel and spatial reliability.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017

  30. [31]

    ”Visual object tracking using adaptive correlation filters.” 2010 IEEE computer society conference on computer vision and pattern recognition

    Bolme, David S., et al. ”Visual object tracking using adaptive correlation filters.” 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010

  31. [32]

    ”Real-time tracking via on-line boosting.” Bmvc

    Grabner, Helmut, Michael Grabner, and Horst Bischof. ”Real-time tracking via on-line boosting.” Bmvc. V ol. 1. No. 5. 2006. 13

  32. [33]

    ”Forward- backward error: Automatic detection of tracking failures.” 2010 20th international conference on pattern recognition

    Kalal, Zdenek, Krystian Mikolajczyk, and Jiri Matas. ”Forward- backward error: Automatic detection of tracking failures.” 2010 20th international conference on pattern recognition. IEEE, 2010

  33. [34]

    ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems 27 (2014)

    Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems 27 (2014)

  34. [35]

    ”Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE international conference on computer vision

    Eigen, David, and Rob Fergus. ”Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE international conference on computer vision. 2015

  35. [36]

    ”Deep ordinal regression network for monocular depth estimation.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Fu, Huan, et al. ”Deep ordinal regression network for monocular depth estimation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018

  36. [37]

    ”Adabins: Depth estimation using adaptive bins.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bhat, Shariq Farooq, Ibraheem Alhashim, and Peter Wonka. ”Adabins: Depth estimation using adaptive bins.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021

  37. [38]

    Godard, Cl ´ement, Oisin Mac Aodha, and Gabriel J. Brostow. ”Unsuper- vised monocular depth estimation with left-right consistency.” Proceed- ings of the IEEE conference on computer vision and pattern recognition. 2017

  38. [39]

    ”Unsupervised monocular depth estimation using attention and multi-warp reconstruc- tion.” IEEE Transactions on Multimedia (2021)

    Ling, Chuanwu, Xiaogang Zhang, and Hua Chen. ”Unsupervised monocular depth estimation using attention and multi-warp reconstruc- tion.” IEEE Transactions on Multimedia (2021)

  39. [40]

    ”Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras.” Proceedings of the IEEE/CVF International Conference on Computer Vision

    Gordon, Ariel, et al. ”Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019

  40. [41]

    ”Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera.” 2019 International Conference on Robotics and Automation (ICRA)

    Ma, Fangchang, Guilherme Venturelli Cavalheiro, and Sertac Karaman. ”Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera.” 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019

  41. [42]

    ”Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving.” IEEE Transactions on Intelligent Transportation Systems (2021)

    Song, Zhenbo, et al. ”Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving.” IEEE Transactions on Intelligent Transportation Systems (2021)

  42. [43]

    ”Selfdeco: Self-supervised monocular depth com- pletion in challenging indoor environments.” 2021 IEEE International Conference on Robotics and Automation (ICRA)

    Choi, Jaehoon, et al. ”Selfdeco: Self-supervised monocular depth com- pletion in challenging indoor environments.” 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021

  43. [44]

    ”Learning rich features from RGB-D images for object detection and segmentation.” European conference on computer vision

    Gupta, Saurabh, et al. ”Learning rich features from RGB-D images for object detection and segmentation.” European conference on computer vision. Springer, Cham, 2014

  44. [45]

    ”Multimodal deep learning for robust RGB-D ob- ject recognition.” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Eitel, Andreas, et al. ”Multimodal deep learning for robust RGB-D ob- ject recognition.” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015

  45. [46]

    ”CANet: Co-attention network for RGB-D semantic segmentation.” Pattern Recognition 124 (2022): 108468

    Zhou, Hao, et al. ”CANet: Co-attention network for RGB-D semantic segmentation.” Pattern Recognition 124 (2022): 108468

  46. [47]

    ”Intrinsic scene properties from a single rgb-d image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Barron, Jonathan T., and Jitendra Malik. ”Intrinsic scene properties from a single rgb-d image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013

  47. [48]

    ”Single image depth estimation from predicted semantic labels.” 2010 IEEE computer society conference on computer vision and pattern recognition

    Liu, Beyang, Stephen Gould, and Daphne Koller. ”Single image depth estimation from predicted semantic labels.” 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010

  48. [49]

    ”Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance.” European Conference on Computer Vision

    Klingner, Marvin, et al. ”Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance.” European Conference on Computer Vision. Springer, Cham, 2020

  49. [50]

    ”Robust object proposals re- ranking for object detection in autonomous driving using convolutional neural networks.” Signal Processing: Image Communication 53 (2017): 110-122

    Pham, Cuong Cao, and Jae Wook Jeon. ”Robust object proposals re- ranking for object detection in autonomous driving using convolutional neural networks.” Signal Processing: Image Communication 53 (2017): 110-122

  50. [51]

    ”Data-driven 3d voxel patterns for object category recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Xiang, Yu, et al. ”Data-driven 3d voxel patterns for object category recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

  51. [52]

    ”Pointnet: Deep learning on point sets for 3d classification and segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Qi, Charles R., et al. ”Pointnet: Deep learning on point sets for 3d classification and segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017

  52. [53]

    ”Pointnet++: Deep hierarchical feature learning on point sets in a metric space.” Advances in neural information processing systems 30 (2017)

    Qi, Charles Ruizhongtai, et al. ”Pointnet++: Deep hierarchical feature learning on point sets in a metric space.” Advances in neural information processing systems 30 (2017)

  53. [54]

    ”Frustum pointnets for 3d object detection from rgb-d data.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Qi, Charles R., et al. ”Frustum pointnets for 3d object detection from rgb-d data.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018

  54. [55]

    ”Pointnetlk: Robust & efficient point cloud registration using pointnet.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Aoki, Yasuhiro, et al. ”Pointnetlk: Robust & efficient point cloud registration using pointnet.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019

  55. [56]

    ”CoBEVT: Cooperative bird’s eye view semantic segmentation with sparse transformers.” arXiv preprint arXiv:2207.02202 (2022)

    Xu, Runsheng, et al. ”CoBEVT: Cooperative bird’s eye view semantic segmentation with sparse transformers.” arXiv preprint arXiv:2207.02202 (2022)

  56. [57]

    ”V2X-ViT: Vehicle-to-everything cooperative per- ception with vision transformer.” arXiv preprint arXiv:2203.10638 (2022)

    Xu, Runsheng, et al. ”V2X-ViT: Vehicle-to-everything cooperative per- ception with vision transformer.” arXiv preprint arXiv:2203.10638 (2022)

  57. [58]

    Zhu, Xingkui, et al. ”TPH-YOLOv5: Improved YOLOv5 based on trans- former prediction head for object detection on drone-captured scenarios.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

  58. [59]

    ”Centernet: Keypoint triplets for object detection.” Proceedings of the IEEE/CVF international conference on computer vision

    Duan, Kaiwen, et al. ”Centernet: Keypoint triplets for object detection.” Proceedings of the IEEE/CVF international conference on computer vision. 2019

  59. [60]

    ”Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision

    Lin, Tsung-Yi, et al. ”Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017

  60. [61]

    ”Fcos: Fully convolutional one-stage object detection.” Proceedings of the IEEE/CVF international conference on computer vision

    Tian, Zhi, et al. ”Fcos: Fully convolutional one-stage object detection.” Proceedings of the IEEE/CVF international conference on computer vision. 2019

  61. [62]

    ”On estimation of a probability density function and mode.” The annals of mathematical statistics 33.3 (1962): 1065-1076

    Parzen, Emanuel. ”On estimation of a probability density function and mode.” The annals of mathematical statistics 33.3 (1962): 1065-1076

  62. [63]

    ”A new approach to linear filtering and predic- tion problems.” (1960): 35-45

    Kalman, Rudolph Emil. ”A new approach to linear filtering and predic- tion problems.” (1960): 35-45