SHLE: Devices Tracking and Depth Filtering for Stereo-based Height Limit Estimation
Pith reviewed 2026-05-24 10:30 UTC · model grok-4.3
The pith
Stereo pipeline tracks height limit devices then filters depth measurements over time to estimate their clearance with under 10 cm average error at 70 m range.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHLE achieves an average error below 10 cm even when the car is 70 m from the devices by first detecting and tracking the height limit objects in the left or right image, then temporally measuring, extracting, and filtering depth values to compute the limit; the method outperforms all compared baselines on the Disparity Height dataset and reaches state-of-the-art performance.
What carries the argument
The SHLE two-stage pipeline: devices detection and tracking followed by depth measurement, extraction, and filtering.
If this is right
- Vehicles equipped with stereo cameras can generate real-time height alerts without expensive sensors.
- Early detection at long range gives drivers time to adjust speed or route.
- The same tracking-plus-filtering approach can be applied to other roadside objects whose clearance matters.
- The released Disparity Height dataset provides a common test bed for future stereo height methods.
Where Pith is reading between the lines
- The method could be combined with map data so that once a device is measured its height is stored for later trips.
- If depth filtering proves robust in rain or at night, the same pipeline might extend to other low-light traffic safety tasks.
- Integration with vehicle CAN bus data could allow automatic speed reduction when an over-height risk is confirmed.
Load-bearing premise
The depth filtering stage can reliably isolate and stabilize measurements to the tracked device across frames despite stereo matching noise, occlusions, or scene motion.
What would settle it
Run the pipeline on stereo video sequences of known-height devices at 70 m distance and measure whether the average absolute error exceeds 10 cm.
Figures
read the original abstract
Recently, over-height vehicle strike frequently occurs, causing great economic cost and serious safety problems. Hence, an alert system which can accurately discover any possible height limiting devices in advance is necessary to be employed in modern large or medium sized cars, such as touring cars. Detecting and estimating the height limiting devices act as the key point of a successful height limit alert system. Though there are some works research height limit estimation, existing methods are either too computational expensive or not accurate enough. In this paper, we propose a novel stereo-based pipeline named SHLE for height limit estimation. Our SHLE pipeline consists of two stages. In stage 1, a novel devices detection and tracking scheme is introduced, which accurately locate the height limit devices in the left or right image. Then, in stage 2, the depth is temporally measured, extracted and filtered to calculate the height limit device. To benchmark the height limit estimation task, we build a large-scale dataset named "Disparity Height", where stereo images, pre-computed disparities and ground-truth height limit annotations are provided. We conducted extensive experiments on "Disparity Height" and the results show that SHLE achieves an average error below than 10cm though the car is 70m away from the devices. Our method also outperforms all compared baselines and achieves state-of-the-art performance. Code is available at https://github.com/Yang-Kaixing/SHLE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SHLE, a two-stage stereo pipeline for height-limit estimation in vehicles. Stage 1 detects and tracks height-limit devices in left/right images; stage 2 performs temporal depth measurement, extraction, and filtering to compute device heights. A new 'Disparity Height' dataset is introduced containing stereo images, pre-computed disparities, and ground-truth height annotations. Experiments on this dataset are reported to show average height error below 10 cm at distances up to 70 m, with SHLE outperforming all baselines and achieving state-of-the-art performance. Code is released at a public GitHub repository.
Significance. If the temporal filtering stage can demonstrably suppress stereo-matching noise, occlusions, and scene motion to the precision needed for sub-10 cm height error at 70 m, the work would offer a practical, deployable component for automotive height-limit alert systems. Public release of code and a new benchmark dataset are concrete strengths that support reproducibility.
major comments (3)
- [Abstract] Abstract: the headline claim of average error below 10 cm at 70 m is load-bearing for the contribution, yet the manuscript provides neither distance-binned error statistics nor quantitative evidence that the stage-2 filtering reduces effective disparity error sufficiently to overcome the quadratic growth of stereo depth uncertainty (standard propagation δd ≈ (d²/(f·b))·δdisp).
- [Stage 2] Stage 2 (depth filtering): the description of 'temporally measured, extracted and filtered' depth lacks the concrete algorithm (median, Kalman, or other), ablation studies, or variance-reduction measurements needed to evaluate whether it can isolate device measurements under the dataset's noise, occlusion, and motion conditions.
- [Experiments] Experiments section: no table or figure reports error versus distance, dataset distance distribution, or long-range sample counts; without these the SOTA claim and the 70 m result cannot be verified against the known quadratic error scaling.
minor comments (3)
- [Abstract] Abstract contains the ungrammatical phrase 'below than 10cm'; correct to 'below 10 cm'.
- Dataset statistics (number of sequences, distance histogram, number of devices at >50 m) are not reported, hindering assessment of the benchmark's difficulty and coverage.
- Baseline implementations and training details are referenced only generically; explicit citations or configuration tables would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve verifiability of the results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of average error below 10 cm at 70 m is load-bearing for the contribution, yet the manuscript provides neither distance-binned error statistics nor quantitative evidence that the stage-2 filtering reduces effective disparity error sufficiently to overcome the quadratic growth of stereo depth uncertainty (standard propagation δd ≈ (d²/(f·b))·δdisp).
Authors: We agree that distance-binned statistics and explicit evidence on filtering efficacy would strengthen the claim. In revision we will add a table/figure with mean absolute error binned by distance (including per-bin sample counts) and a quantitative comparison of disparity variance before versus after temporal filtering to demonstrate reduction relative to the quadratic uncertainty scaling. revision: yes
-
Referee: [Stage 2] Stage 2 (depth filtering): the description of 'temporally measured, extracted and filtered' depth lacks the concrete algorithm (median, Kalman, or other), ablation studies, or variance-reduction measurements needed to evaluate whether it can isolate device measurements under the dataset's noise, occlusion, and motion conditions.
Authors: We will expand the Stage 2 section to name the exact filtering algorithm and its parameters, add ablation results (with/without filtering), and report measured variance reduction on the disparity values under the dataset conditions. revision: yes
-
Referee: [Experiments] Experiments section: no table or figure reports error versus distance, dataset distance distribution, or long-range sample counts; without these the SOTA claim and the 70 m result cannot be verified against the known quadratic error scaling.
Authors: We will add to the Experiments section a plot or table of error versus distance, the distance histogram of the Disparity Height dataset, and explicit counts of samples at long ranges (including near 70 m) to support verification. revision: yes
Circularity Check
No circularity; empirical pipeline evaluated on held-out dataset
full rationale
The SHLE pipeline is a two-stage stereo vision method (device detection/tracking then temporal depth extraction/filtering) whose performance claims are supported solely by empirical results on the independently annotated 'Disparity Height' dataset. No equations, derivations, or 'predictions' are presented that reduce by construction to fitted inputs, self-citations, or ansatzes; the method contains no load-bearing uniqueness theorems or self-referential definitions. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dubey, Aish. ”Stereo vision-Facing the challenges and seeing the op- portunities for ADAS applications.” Texas Instruments Technical Note (2016)
work page 2016
-
[2]
”Distance measurement system for au- tonomous vehicles using stereo camera.” Array 5 (2020): 100016
Zaarane, Abdelmoghit, et al. ”Distance measurement system for au- tonomous vehicles using stereo camera.” Array 5 (2020): 100016
work page 2020
-
[4]
He, Kaiming, et al. ”Spatial pyramid pooling in deep convolutional networks for visual recognition.” IEEE transactions on pattern analysis and machine intelligence 37.9 (2015): 1904-1916
work page 2015
-
[5]
Vision-based over-height vehicle detection for warning drivers
Nguyen, Bella. Vision-based over-height vehicle detection for warning drivers. Diss. University of Cambridge, 2018
work page 2018
-
[6]
” 车载限高障碍物检测系统的设计与实现.” 电光系统 2 (2018): 13-17
刘梦. ” 车载限高障碍物检测系统的设计与实现.” 电光系统 2 (2018): 13-17
work page 2018
-
[7]
” 激光雷达辅助驾驶道路参数计算方法研究.” 应用光学 41.1 (2020): 209
游安清, et al. ” 激光雷达辅助驾驶道路参数计算方法研究.” 应用光学 41.1 (2020): 209
work page 2020
- [8]
-
[9]
Kwak, Doo-Ahn, et al. ”Detection of individual trees and estimation of tree height using LiDAR data.” Journal of Forest Research 12.6 (2007): 425-434
work page 2007
-
[10]
Rosette, J. A. B., P. R. J. North, and J. C. Suarez. ”Vegetation height estimates for a mixed temperate forest using satellite laser altimetry.” International journal of remote sensing 29.5 (2008): 1475-1493
work page 2008
-
[11]
Chang, Anjin, et al. ”Crop height monitoring with digital imagery from Unmanned Aerial System (UAS).” Computers and Electronics in Agriculture 141 (2017): 232-237
work page 2017
-
[12]
ten Harkel, Jelle, Harm Bartholomeus, and Lammert Kooistra. ”Biomass and crop height estimation of different crops using UA V-based LiDAR.” Remote Sensing 12.1 (2019): 17
work page 2019
-
[13]
Yuan, Wenan, et al. ”Wheat height estimation using LiDAR in compar- ison to ultrasonic sensor and UAS.” Sensors 18.11 (2018): 3731
work page 2018
-
[14]
Schirrmann, Michael, et al. ”Regression kriging for improving crop height models fusing ultra-sonic sensing with UA V imagery.” Remote Sensing 9.7 (2017): 665
work page 2017
-
[15]
Lang, Nico, et al. ”Global canopy height regression and uncertainty estimation from GEDI LIDAR waveforms with deep ensembles.” Remote Sensing of Environment 268 (2022): 112760
work page 2022
-
[16]
Ren, Shaoqing, et al. ”Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information process- ing systems 28 (2015)
work page 2015
-
[17]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional net- works for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016
work page 2016
-
[19]
Sandler, Mark, et al. ”Mobilenetv2: Inverted residuals and linear bot- tlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018
work page 2018
-
[20]
”Searching for mobilenetv3.” Proceedings of the IEEE/CVF international conference on computer vision
Howard, Andrew, et al. ”Searching for mobilenetv3.” Proceedings of the IEEE/CVF international conference on computer vision. 2019
work page 2019
-
[21]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, Alexey, et al. ”An image is worth 16x16 words: Trans- formers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
Liu, Ze, et al. ”Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021
work page 2021
-
[23]
Arnab, Anurag, et al. ”Vivit: A video vision transformer.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021
work page 2021
-
[24]
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolu- tional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
work page 2015
-
[25]
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convo- lutional networks for biomedical image segmentation.” International Con- ference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015
work page 2015
-
[26]
Zhou, Zongwei, et al. ”Unet++: A nested u-net architecture for medical image segmentation.” Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2018. 3-11
work page 2018
-
[27]
Li, Xiaomeng, et al. ”H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes.” IEEE transactions on medical imaging 37.12 (2018): 2663-2674
work page 2018
-
[28]
Babenko, Boris, Ming-Hsuan Yang, and Serge Belongie. ”Robust object tracking with online multiple instance learning.” IEEE transactions on pattern analysis and machine intelligence 33.8 (2010): 1619-1632
work page 2010
-
[29]
Henriques, Jo ˜ao F., et al. ”High-speed tracking with kernelized cor- relation filters.” IEEE transactions on pattern analysis and machine intelligence 37.3 (2014): 583-596
work page 2014
-
[30]
Lukezic, Alan, et al. ”Discriminative correlation filter with channel and spatial reliability.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017
work page 2017
-
[31]
Bolme, David S., et al. ”Visual object tracking using adaptive correlation filters.” 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010
work page 2010
-
[32]
”Real-time tracking via on-line boosting.” Bmvc
Grabner, Helmut, Michael Grabner, and Horst Bischof. ”Real-time tracking via on-line boosting.” Bmvc. V ol. 1. No. 5. 2006. 13
work page 2006
-
[33]
Kalal, Zdenek, Krystian Mikolajczyk, and Jiri Matas. ”Forward- backward error: Automatic detection of tracking failures.” 2010 20th international conference on pattern recognition. IEEE, 2010
work page 2010
-
[34]
Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems 27 (2014)
work page 2014
-
[35]
Eigen, David, and Rob Fergus. ”Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” Proceedings of the IEEE international conference on computer vision. 2015
work page 2015
-
[36]
Fu, Huan, et al. ”Deep ordinal regression network for monocular depth estimation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018
work page 2018
-
[37]
Bhat, Shariq Farooq, Ibraheem Alhashim, and Peter Wonka. ”Adabins: Depth estimation using adaptive bins.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021
work page 2021
-
[38]
Godard, Cl ´ement, Oisin Mac Aodha, and Gabriel J. Brostow. ”Unsuper- vised monocular depth estimation with left-right consistency.” Proceed- ings of the IEEE conference on computer vision and pattern recognition. 2017
work page 2017
-
[39]
Ling, Chuanwu, Xiaogang Zhang, and Hua Chen. ”Unsupervised monocular depth estimation using attention and multi-warp reconstruc- tion.” IEEE Transactions on Multimedia (2021)
work page 2021
-
[40]
Gordon, Ariel, et al. ”Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019
work page 2019
-
[41]
Ma, Fangchang, Guilherme Venturelli Cavalheiro, and Sertac Karaman. ”Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera.” 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019
work page 2019
-
[42]
Song, Zhenbo, et al. ”Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving.” IEEE Transactions on Intelligent Transportation Systems (2021)
work page 2021
-
[43]
Choi, Jaehoon, et al. ”Selfdeco: Self-supervised monocular depth com- pletion in challenging indoor environments.” 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021
work page 2021
-
[44]
Gupta, Saurabh, et al. ”Learning rich features from RGB-D images for object detection and segmentation.” European conference on computer vision. Springer, Cham, 2014
work page 2014
-
[45]
Eitel, Andreas, et al. ”Multimodal deep learning for robust RGB-D ob- ject recognition.” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015
work page 2015
-
[46]
Zhou, Hao, et al. ”CANet: Co-attention network for RGB-D semantic segmentation.” Pattern Recognition 124 (2022): 108468
work page 2022
-
[47]
Barron, Jonathan T., and Jitendra Malik. ”Intrinsic scene properties from a single rgb-d image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013
work page 2013
-
[48]
Liu, Beyang, Stephen Gould, and Daphne Koller. ”Single image depth estimation from predicted semantic labels.” 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010
work page 2010
-
[49]
Klingner, Marvin, et al. ”Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance.” European Conference on Computer Vision. Springer, Cham, 2020
work page 2020
-
[50]
Pham, Cuong Cao, and Jae Wook Jeon. ”Robust object proposals re- ranking for object detection in autonomous driving using convolutional neural networks.” Signal Processing: Image Communication 53 (2017): 110-122
work page 2017
-
[51]
Xiang, Yu, et al. ”Data-driven 3d voxel patterns for object category recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
work page 2015
-
[52]
Qi, Charles R., et al. ”Pointnet: Deep learning on point sets for 3d classification and segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017
work page 2017
-
[53]
Qi, Charles Ruizhongtai, et al. ”Pointnet++: Deep hierarchical feature learning on point sets in a metric space.” Advances in neural information processing systems 30 (2017)
work page 2017
-
[54]
Qi, Charles R., et al. ”Frustum pointnets for 3d object detection from rgb-d data.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018
work page 2018
-
[55]
Aoki, Yasuhiro, et al. ”Pointnetlk: Robust & efficient point cloud registration using pointnet.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019
work page 2019
-
[56]
Xu, Runsheng, et al. ”CoBEVT: Cooperative bird’s eye view semantic segmentation with sparse transformers.” arXiv preprint arXiv:2207.02202 (2022)
-
[57]
Xu, Runsheng, et al. ”V2X-ViT: Vehicle-to-everything cooperative per- ception with vision transformer.” arXiv preprint arXiv:2203.10638 (2022)
-
[58]
Zhu, Xingkui, et al. ”TPH-YOLOv5: Improved YOLOv5 based on trans- former prediction head for object detection on drone-captured scenarios.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021
work page 2021
-
[59]
Duan, Kaiwen, et al. ”Centernet: Keypoint triplets for object detection.” Proceedings of the IEEE/CVF international conference on computer vision. 2019
work page 2019
-
[60]
Lin, Tsung-Yi, et al. ”Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017
work page 2017
-
[61]
Tian, Zhi, et al. ”Fcos: Fully convolutional one-stage object detection.” Proceedings of the IEEE/CVF international conference on computer vision. 2019
work page 2019
-
[62]
Parzen, Emanuel. ”On estimation of a probability density function and mode.” The annals of mathematical statistics 33.3 (1962): 1065-1076
work page 1962
-
[63]
”A new approach to linear filtering and predic- tion problems.” (1960): 35-45
Kalman, Rudolph Emil. ”A new approach to linear filtering and predic- tion problems.” (1960): 35-45
work page 1960
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.