Recognition: 2 theorem links
· Lean TheoremNeuromorphic Monocular Depth Estimation with Uncertainty Modeling
Pith reviewed 2026-05-12 04:50 UTC · model grok-4.3
The pith
Integrating uncertainty estimation into neural networks allows event-based monocular depth prediction to flag reliable pixels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We predict per-pixel depth distributions from monocular event streams using U-Net models and estimate uncertainty with Gaussian, log-normal, and evidential learning frameworks. We compare six event representations and find that the representations perform similarly after synthetic pre-training and real fine-tuning, with 10-bin log-normal and 5-bin evidential models performing best across absolute relative error, root mean squared error, and area under the sparsification error. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation and used to indicate pixels with reliable depth.
What carries the argument
U-Net models that predict depth distributions while simultaneously estimating uncertainty, applied to event representations such as multi-bin spatio-temporal voxel grids, CSTR, and TORE volumes.
If this is right
- Uncertainty estimates can be used to identify and ignore pixels whose depth values are likely incorrect.
- 10 temporal bin voxel grids paired with log-normal uncertainty and 5 temporal bin voxel grids paired with evidential learning achieve the strongest results on standard depth and sparsification metrics.
- Performance remains comparable across the tested event representations once the models are fine-tuned on real data.
- Synthetic pre-training followed by limited real fine-tuning supports deployment in practical event-camera settings.
Where Pith is reading between the lines
- Robotic systems using event cameras could discard uncertain depth values before making navigation decisions.
- Uncertainty maps might guide active sensing strategies that allocate more events to uncertain regions.
- The same uncertainty machinery could be tested on longer temporal sequences to check consistency over time.
Load-bearing premise
Fine-tuning on a limited set of real sequences after synthetic pre-training produces depth and uncertainty estimates that generalize to new real-world environments without significant domain shift or overfitting.
What would settle it
An evaluation on held-out real event sequences in which removing high-uncertainty pixels fails to reduce the sparsification error would show that the uncertainty estimates do not reliably indicate accurate depth.
Figures
read the original abstract
Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes U-Net models for per-pixel depth distribution prediction from monocular event camera streams, incorporating uncertainty via Gaussian, log-normal, and evidential frameworks. It evaluates six event representations (voxel grids with varying temporal bins, CSTR, TORE) after synthetic pre-training and real-sequence fine-tuning, reporting best performance for 10-bin log-normal and 5-bin evidential variants on absolute relative error, RMSE, and AUSE metrics. The central claim is that uncertainty estimation integrates successfully into event-based depth estimation and can indicate pixels with reliable depth.
Significance. If the empirical results and generalization claims hold, the work provides a useful empirical benchmark for uncertainty-aware neuromorphic depth estimation, highlighting practical combinations of representations and uncertainty models that could aid robust perception in high-dynamic-range or low-light scenarios. The AUSE-based evaluation of uncertainty calibration is a positive aspect, as is the systematic comparison across representations.
major comments (3)
- [Abstract] Abstract: The claim that uncertainty 'can be used to indicate pixels with reliable depth' is load-bearing but unsupported by evidence of generalization. The abstract reports only aggregate AUSE scores without describing the real-data split, number of sequences, scene diversity, or any held-out real test set drawn from a different environment or sensor; AUSE on the fine-tuning distribution alone cannot substantiate the generalization assumption.
- [Experiments] Experiments section (inferred from quantitative results description): No details are provided on statistical testing, error bars, or variance across multiple training runs for the reported metrics (e.g., the best 10-bin log-normal and 5-bin evidential results). This weakens confidence in the ranking of representations and uncertainty frameworks.
- [Abstract] Abstract and evaluation: Baseline comparisons are limited to internal variants; the manuscript does not report comparisons against prior event-based depth methods or frame-based equivalents on the same real sequences, making it difficult to assess the absolute advance in depth accuracy or uncertainty quality.
minor comments (2)
- [Methods] Clarify the exact definitions and hyperparameters of the six event representations (e.g., how temporal bin counts are chosen and normalized) in the methods section to improve reproducibility.
- [Abstract] The abstract states 'quantitative results show that the representations perform similarly' yet highlights specific winners; add a table or figure with all metric values for all combinations to support this statement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the AUSE evaluation and systematic comparisons. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] The claim that uncertainty 'can be used to indicate pixels with reliable depth' is load-bearing but unsupported by evidence of generalization. The abstract reports only aggregate AUSE scores without describing the real-data split, number of sequences, scene diversity, or any held-out real test set drawn from a different environment or sensor; AUSE on the fine-tuning distribution alone cannot substantiate the generalization assumption.
Authors: We appreciate the referee's point on clarifying the scope of our claims. The evaluation uses held-out portions of the real sequences after synthetic pre-training and fine-tuning, demonstrating that uncertainty correlates with depth error on unseen real data from the same sensor and environment distribution. In the revised manuscript, we will expand both the abstract and experiments section to explicitly describe the real-data splits (train/validation/test partitioning), number of sequences, and scene diversity. We will also qualify the generalization statement to reflect that the results support reliable-depth indication within the evaluated real sequences, while noting cross-environment or cross-sensor generalization as future work. These additions will better ground the abstract claim. revision: yes
-
Referee: [Experiments] No details are provided on statistical testing, error bars, or variance across multiple training runs for the reported metrics (e.g., the best 10-bin log-normal and 5-bin evidential results). This weakens confidence in the ranking of representations and uncertainty frameworks.
Authors: We agree that reporting variability across runs would increase confidence in the metric rankings. In the revised version, we will add error bars (standard deviation across 3-5 independent training runs with different random seeds) for the key absolute relative error, RMSE, and AUSE results. We will also note any statistically significant differences between the top variants where appropriate. This revision directly addresses the concern without altering the core findings. revision: yes
-
Referee: [Abstract] Baseline comparisons are limited to internal variants; the manuscript does not report comparisons against prior event-based depth methods or frame-based equivalents on the same real sequences, making it difficult to assess the absolute advance in depth accuracy or uncertainty quality.
Authors: The manuscript's primary aim is a controlled, systematic benchmark of event representations and uncertainty models under a unified training protocol rather than establishing new state-of-the-art depth accuracy. We will revise the related-work and experiments sections to include a discussion of prior event-based depth methods, referencing their reported metrics on comparable datasets, and to contextualize our depth accuracy and uncertainty quality relative to those works. Direct comparisons on identical real sequences are limited by differences in data splits and protocols across the literature; we will explicitly acknowledge this constraint while emphasizing the value of the internal ablation for isolating representation and uncertainty effects. revision: partial
Circularity Check
No circularity: purely empirical training and metric evaluation on held-out data
full rationale
The paper trains U-Net models on synthetic event data, fine-tunes on real sequences, and reports aggregate metrics (Abs Rel, RMSE, AUSE) for depth and uncertainty. No equations, predictions, or uniqueness claims are present that reduce by construction to fitted parameters, self-citations, or ansatzes defined in terms of the target outputs. The pipeline is self-contained empirical ML work with standard held-out evaluation; the reader's assessment of score 1.0 is consistent with this finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- Temporal bin counts =
1,5,10,20
- U-Net weights
axioms (1)
- domain assumption Event streams can be losslessly or near-losslessly converted into fixed-size spatio-temporal volumes suitable for convolutional networks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks... U-Net-based models are trained on synthetic data and then fine-tuned on real sequences... area under the sparsification error
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins... 8-tick period never mentioned
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scara- muzza, “Event-based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2022
work page 2022
-
[2]
Unsupervised event- based learning of optical flow, depth and egomotion,
A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event- based learning of optical flow, depth and egomotion,” inConference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
-
[3]
Learning monocular dense depth from events,
J. Hidalgo-Carrio, D. Gehrig, and D. Scaramuzza, “Learning monocular dense depth from events,”IEEE International Conference on 3D Vision.(3DV), 2020. [Online]. Available:{http://rpg.ifi.uzh.ch/docs/ 3DV20 Hidalgo.pdf}
work page 2020
-
[4]
Cstr: A compact spatio-temporal representation for event-based vision,
Z. A. El Shair, A. Hassani, and S. A. Rawashdeh, “Cstr: A compact spatio-temporal representation for event-based vision,”IEEE Access, pp. 102 899–102 916, 2023
work page 2023
-
[5]
Time-ordered recent event (TORE) volumes for event cameras,
R. W. Baldwin, R. Liu, M. Almatrafi, V . Asari, and K. Hirakawa, “Time-ordered recent event (TORE) volumes for event cameras,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2519–2532, 2023
work page 2023
-
[6]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024
work page 2024
-
[7]
Y . Li, Y . Shen, Z. Huang, W. B. Shuo Chen, X. Shi, F.-Y . Wang, K. Sun, H. Bao, Z. Cui, G. Zhang, and H. Li, “Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events,” inEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[8]
Dvs-voltmeter: Stochastic process-based event simulator for dynamic vision sensors,
S. Lin, Y . Ma, Z. Guo, and B. Wen, “Dvs-voltmeter: Stochastic process-based event simulator for dynamic vision sensors,” in Computer Vision – ECCV 2022, 2022. [Online]. Available: https: //doi.org/10.1007/978-3-031-20071-7 34
-
[9]
The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,
A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2032–2039, 2018
work page 2032
-
[10]
Address-event based stereo vision with bio-inspired silicon retina imagers,
J. Kogler, C. Sulzbachner, M. Humenberger, and F. Eibensteiner, “Address-event based stereo vision with bio-inspired silicon retina imagers,” inAdvances in Theory and Applications of Stereo Vision, A. Bhatti, Ed. Rijeka: IntechOpen, 2011, ch. 9. [Online]. Available: https://doi.org/10.5772/12941
-
[11]
On the use of orientation filters for 3d reconstruction in event-driven stereo vision,
L. A. Camunas-Mesa, T. Serrano-Gotarredona, S. H. Ieng, R. B. Benosman, and B. Linares-Barranco, “On the use of orientation filters for 3d reconstruction in event-driven stereo vision,” Frontiers in Neuroscience, vol. V olume 8 - 2014, 2014. [Online]. Available: https://www.frontiersin.org/journals/neuroscience/articles/10. 3389/fnins.2014.00048
-
[12]
Asynchronous event-based binocular stereo matching,
P. Rogister, R. Benosman, S.-H. Ieng, P. Lichtsteiner, and T. Delbruck, “Asynchronous event-based binocular stereo matching,”IEEE Transac- tions on Neural Networks and Learning Systems, vol. 23, no. 2, pp. 347–353, 2012
work page 2012
-
[13]
EMVS: Event-based multi-view stereo—3D reconstruction with an event camera in real-time,
H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza, “EMVS: Event-based multi-view stereo—3D reconstruction with an event camera in real-time,”Int. J. Comput. Vis., vol. 126, pp. 1394–1414, Dec. 2018. Accepted to the Challenges and Opportunities of Neuromorphic Field Robotics and Automation IEEE ICRA Workshop - 2026 Events Ground truth E2Depth, 5 Bin Lo...
work page 2018
-
[14]
G. Gallego, H. Rebecq, and D. Scaramuzza, “A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3867–3876
work page 2018
-
[15]
Focus Is All You Need: Loss Functions for Event-Based Vision ,
G. Gallego, M. Gehrig, and D. Scaramuzza, “ Focus Is All You Need: Loss Functions for Event-Based Vision ,” in2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2019. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/CVPR.2019.01256
-
[16]
Secrets of event-based optical flow,
S. Shiba, Y . Aoki, and G. Gallego, “Secrets of event-based optical flow,” inEuropean Conference on Computer Vision (ECCV), 2022, pp. 628– 645
work page 2022
-
[17]
Dtam: Dense tracking and mapping in real-time,
R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in2011 international conference on computer vision. IEEE, 2011, pp. 2320–2327
work page 2011
-
[18]
Semi-dense 3d reconstruction with a stereo event camera,
Y . Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza, “Semi-dense 3d reconstruction with a stereo event camera,” inProceed- ings of the European conference on computer vision (ECCV), 2018, pp. 235–251
work page 2018
-
[19]
Active Event Alignment for Monocular Distance Estimation ,
N. Cai and P. Bideau, “ Active Event Alignment for Monocular Distance Estimation ,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, 2025, pp. 2464– 2473
work page 2025
-
[20]
U-net: Convolutional networks for biomedical image segmentation
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation.”International Conference on Med- ical image computing, 2015
work page 2015
-
[21]
D. Gehrig, M. R ¨uegg, M. Gehrig, J. Hidalgo-Carrio, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,”IEEE Robotic and Automation Letters. (RA-L), 2021. [Online]. Available: http://rpg.ifi.uzh. ch/docs/RAL21 Gehrig.pdf
work page 2021
-
[22]
Learning monocular depth from events via egomotion compensation,
H. Meng, C. Zhong, S. Tang, L. JunJia, W. Lin, Z. Bing, Y . Chang, G. Chen, and A. Knoll, “Learning monocular depth from events via egomotion compensation,” 2024, arXiv preprint arXiv:2412.19067, sub- mitted Dec. 26, 2024
-
[23]
Depth anyevent: A cross-modal distillation paradigm for event-based monoc- ular depth estimation,
L. Bartolomei, E. Mannocci, F. Tosi, M. Poggi, and S. Mattoccia, “Depth anyevent: A cross-modal distillation paradigm for event-based monoc- ular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 669–19 678
work page 2025
-
[24]
Explor- ing event-based human pose estimation with 3d event representations,
X. Yin, H. Shi, J. Chen, Z. Wang, Y . Ye, K. Yang, and K. Wang, “Explor- ing event-based human pose estimation with 3d event representations,” Computer Vision and Image Understanding, p. 104189, 2024
work page 2024
-
[25]
Evidential deep learning to quantify classification uncertainty,
M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[26]
A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,”Advances in neural information processing systems, vol. 33, pp. 14 927–14 937, 2020
work page 2020
-
[27]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. [Online]. Available: https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Tore-based disparity estimation in stereo event-only vision,
R. Liu, R. W. Baldwin, V . Asari, and K. Hirakawa, “Tore-based disparity estimation in stereo event-only vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, cVPR 2021 Workshop on Event-Based Vision DSEC Competition Submission 1. [On- line]. Available: https://dsec.ifi.uzh.ch/wp-content/uploa...
work page 2021
-
[29]
Uncertainty quantification metrics for deep regression,
S. Kristoffersson Lind, Z. Xiong, P.-E. Forss ´en, and V . Kr ¨uger, “Uncertainty quantification metrics for deep regression,”Pattern Recognition Letters, vol. 186, pp. 91–97, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865524002733
work page 2024
-
[30]
Explor- ing event-based human pose estimation with 3d event representations,
X. Yin, H. Shi, J. Chen, Z. Wang, Y . Ye, K. Yang, and K. Wang, “Explor- ing event-based human pose estimation with 3d event representations,” Computer Vision and Image Understanding, 2023
work page 2023
-
[31]
Transformer-based attention networks for continuous pixel-wise prediction,
G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” inICCV, 2021. Accepted to the Challenges and Opportunities of Neuromorphic Field Robotics and Automation IEEE ICRA Workshop - 2026
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.