SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams
Pith reviewed 2026-05-19 14:06 UTC · model grok-4.3
The pith
SpikeStereoNet estimates stereo depth directly from raw spike streams by fusing viewpoints and refining via recurrent spiking networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpikeStereoNet is the first framework to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. It outperforms existing methods on both a new large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations, particularly by capturing subtle edges and intensity shifts in textureless surfaces and extreme lighting conditions, while maintaining high accuracy even with substantially reduced training data.
What carries the argument
The recurrent spiking neural network (RSNN) update module, which fuses stereo spike streams and iteratively refines depth estimates without frame conversion.
If this is right
- Superior performance on synthetic and real-world spike datasets in regions with textureless surfaces or extreme lighting.
- High accuracy maintained even when training data is substantially reduced.
- New benchmarks established via large-scale synthetic spike streams and real-world stereo spike data with dense depth labels.
- Direct use of asynchronous spike events enables handling of rapidly changing scenes where frame cameras struggle.
Where Pith is reading between the lines
- Direct spike processing could extend to other high-speed vision tasks such as motion estimation or object tracking in robotics.
- The data-efficiency property may reduce the need for large labeled datasets in event-based stereo systems.
- Integration with existing spike-camera hardware could support low-power, real-time depth in autonomous navigation.
Load-bearing premise
Raw spike streams from stereo viewpoints contain enough information for accurate depth estimation when processed directly by the recurrent spiking update module.
What would settle it
If a conventional frame-based stereo method, after converting spike streams to intensity frames, matches or exceeds SpikeStereoNet accuracy on the introduced real-world dataset, the advantage of direct spike processing would be called into question.
read the original abstract
Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SpikeStereoNet, a brain-inspired framework for stereo depth estimation directly from raw spike streams. It fuses spike data from two viewpoints and iteratively refines depth via a recurrent spiking neural network (RSNN) update module. New large-scale synthetic and real-world stereo spike datasets with dense depth annotations are introduced, with claims of outperformance over existing methods and strong data efficiency even with reduced training data.
Significance. If the empirical claims hold, the work would be significant for event-based and spike-based vision by filling a gap in specialized stereo algorithms for microsecond-resolution spike cameras. The public release of datasets and code supports reproducibility, and the data-efficiency results could have practical value in data-scarce regimes. The brain-inspired RSNN refinement approach offers a novel direction distinct from frame-based stereo pipelines.
major comments (2)
- [§3.2] §3.2 (Input Fusion and RSNN Module): The central claim of direct estimation from raw spike streams without frame-like representations or hand-crafted features is load-bearing but not fully secured. The architecture description does not explicitly rule out temporal binning, rate coding, or accumulation steps before the RSNN update module receives input; any such preprocessing would contradict the 'raw spike streams' and 'no explicit conversion' assertions.
- [§4.1–4.2] §4.1–4.2 (Experiments): Quantitative results (EPE, D1, etc.) are presented for synthetic and real-world datasets, but the data-efficiency experiments lack ablations isolating the contribution of the RSNN iterative refinement versus the spike input alone, weakening attribution of the reported gains.
minor comments (2)
- [§2] §2 (Related Work): A few recent spike-based stereo or depth works are omitted from the literature review.
- [Figure 4] Figure 4 (Qualitative Results): Depth map visualizations would benefit from explicit scale bars or error heatmaps to aid interpretation of challenging regions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance for event-based stereo vision. We address each major comment in detail below and have revised the manuscript accordingly to improve clarity and strengthen the experimental analysis.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Input Fusion and RSNN Module): The central claim of direct estimation from raw spike streams without frame-like representations or hand-crafted features is load-bearing but not fully secured. The architecture description does not explicitly rule out temporal binning, rate coding, or accumulation steps before the RSNN update module receives input; any such preprocessing would contradict the 'raw spike streams' and 'no explicit conversion' assertions.
Authors: We thank the referee for this important observation. The manuscript's Section 3.2 describes the input fusion as directly concatenating the asynchronous spike events from the stereo spike cameras, with each event carrying its native microsecond timestamp and no conversion to frames or rates. The RSNN update module receives these raw binary spike trains and processes them via its recurrent spiking dynamics without intermediate binning or accumulation. To remove any ambiguity, we have added explicit language in the revised Section 3.2 stating that the pipeline contains no temporal binning, rate coding, or hand-crafted feature extraction prior to the RSNN, thereby reinforcing the 'raw spike streams' and 'no explicit conversion' claims. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (Experiments): Quantitative results (EPE, D1, etc.) are presented for synthetic and real-world datasets, but the data-efficiency experiments lack ablations isolating the contribution of the RSNN iterative refinement versus the spike input alone, weakening attribution of the reported gains.
Authors: We agree that isolating the RSNN's iterative refinement from the raw spike input would strengthen attribution of the data-efficiency gains. We have added new ablation experiments in the revised Section 4.2 that compare the full SpikeStereoNet against a non-recurrent variant using only the initial spike fusion (no RSNN updates) across training data fractions from 10% to 100%. These results, reported with the same EPE and D1 metrics, show that the iterative RSNN refinement accounts for the majority of the performance retention under reduced data, directly addressing the concern. revision: yes
Circularity Check
No circularity: SpikeStereoNet is a new architecture with empirical validation on introduced datasets
full rationale
The paper introduces SpikeStereoNet as a novel brain-inspired RSNN-based framework for direct stereo depth estimation from raw spike streams, along with new synthetic and real-world datasets. Claims rest on architectural design choices and performance comparisons rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central result to prior inputs. No derivation chain is presented that equates outputs to inputs by construction; the work is self-contained against external benchmarks via reported experiments.
Axiom & Free-Parameter Ledger
free parameters (1)
- RSNN trainable weights
axioms (1)
- domain assumption Raw spike streams from two viewpoints contain sufficient information for direct stereo depth estimation without frame conversion.
invented entities (1)
-
RSNN update module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
αt=σ(Conv([st−1,xt],Wα)+cα), … st=f(st−1,s(l−1)t,αt,βt,γt)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.