SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Guozhang Chen; Hao Dong; Hao Tang; Jiyao Zhang; Rui Zhao; Tiejun Huang; Tong Wu; Yihao Li; Zhaofei Yu; Zhuoheng Gao

arxiv: 2505.19487 · v3 · submitted 2025-05-26 · 💻 cs.CV

SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Zhuoheng Gao , Yihao Li , Jiyao Zhang , Rui Zhao , Tong Wu , Hao Tang , Zhaofei Yu , Hao Dong

show 2 more authors

Guozhang Chen Tiejun Huang

This is my paper

Pith reviewed 2026-05-19 14:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo depth estimationspike camerasspiking neural networksevent-based visionbrain-inspired computingdepth from events

0 comments

The pith

SpikeStereoNet estimates stereo depth directly from raw spike streams by fusing viewpoints and refining via recurrent spiking networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpikeStereoNet as the first specialized framework for stereo depth estimation that operates straight on asynchronous spike streams from two viewpoints instead of converting them to frames. It fuses the streams and applies an iterative refinement step through a recurrent spiking neural network module to improve accuracy in difficult regions. The approach is tested on new synthetic and real-world spike datasets with dense ground truth, showing gains over prior methods especially under textureless surfaces, extreme lighting, and with far less training data. A reader would care because spike cameras capture microsecond changes that conventional cameras miss, potentially enabling reliable depth sensing in fast or low-light settings.

Core claim

SpikeStereoNet is the first framework to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. It outperforms existing methods on both a new large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations, particularly by capturing subtle edges and intensity shifts in textureless surfaces and extreme lighting conditions, while maintaining high accuracy even with substantially reduced training data.

What carries the argument

The recurrent spiking neural network (RSNN) update module, which fuses stereo spike streams and iteratively refines depth estimates without frame conversion.

If this is right

Superior performance on synthetic and real-world spike datasets in regions with textureless surfaces or extreme lighting.
High accuracy maintained even when training data is substantially reduced.
New benchmarks established via large-scale synthetic spike streams and real-world stereo spike data with dense depth labels.
Direct use of asynchronous spike events enables handling of rapidly changing scenes where frame cameras struggle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct spike processing could extend to other high-speed vision tasks such as motion estimation or object tracking in robotics.
The data-efficiency property may reduce the need for large labeled datasets in event-based stereo systems.
Integration with existing spike-camera hardware could support low-power, real-time depth in autonomous navigation.

Load-bearing premise

Raw spike streams from stereo viewpoints contain enough information for accurate depth estimation when processed directly by the recurrent spiking update module.

What would settle it

If a conventional frame-based stereo method, after converting spike streams to intensity frames, matches or exceeds SpikeStereoNet accuracy on the introduced real-world dataset, the advantage of direct spike processing would be called into question.

read the original abstract

Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpikeStereoNet introduces the first direct stereo depth method from raw spike streams plus new datasets, but the abstract leaves the performance claims unverified.

read the letter

SpikeStereoNet is the first paper to tackle stereo depth estimation directly from raw spike streams using a recurrent spiking neural network for fusion and refinement, and it comes with new synthetic and real-world datasets that include dense annotations. This fills a clear gap because spike cameras offer microsecond resolution that frame cameras lack in fast-changing or extreme lighting scenes. The approach fuses the two viewpoints' spikes and uses the RSNN to iteratively update the depth map, which aligns with the asynchronous nature of the data. Releasing the code and datasets is a plus, as it gives the community something concrete to test against. The work does a good job highlighting where spike data can help with textureless areas and intensity shifts that trip up standard methods. The data efficiency claim is interesting if it holds, as training on less data would be practical. That said, the abstract provides no specific metrics or comparisons, so the outperformance and efficiency statements are hard to evaluate without the full results section. The stress-test concern about whether the input is truly raw without any binning or conversion is important to verify in the architecture description. If the front-end keeps the spikes as events or direct inputs to the spiking net, the claim stands; if not, it needs rephrasing. From what I can see, the paper engages honestly with the literature on spike vision and doesn't seem to have circular logic. This paper is for people in computer vision working on neuromorphic or event-based sensors, particularly those interested in depth for dynamic environments. A reader focused on robotics or autonomous systems in challenging conditions would find the datasets and baseline useful. I think it deserves a serious referee. The novelty in the datasets and the specialized framework make it worth reviewing, even if revisions are needed for stronger evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SpikeStereoNet, a brain-inspired framework for stereo depth estimation directly from raw spike streams. It fuses spike data from two viewpoints and iteratively refines depth via a recurrent spiking neural network (RSNN) update module. New large-scale synthetic and real-world stereo spike datasets with dense depth annotations are introduced, with claims of outperformance over existing methods and strong data efficiency even with reduced training data.

Significance. If the empirical claims hold, the work would be significant for event-based and spike-based vision by filling a gap in specialized stereo algorithms for microsecond-resolution spike cameras. The public release of datasets and code supports reproducibility, and the data-efficiency results could have practical value in data-scarce regimes. The brain-inspired RSNN refinement approach offers a novel direction distinct from frame-based stereo pipelines.

major comments (2)

[§3.2] §3.2 (Input Fusion and RSNN Module): The central claim of direct estimation from raw spike streams without frame-like representations or hand-crafted features is load-bearing but not fully secured. The architecture description does not explicitly rule out temporal binning, rate coding, or accumulation steps before the RSNN update module receives input; any such preprocessing would contradict the 'raw spike streams' and 'no explicit conversion' assertions.
[§4.1–4.2] §4.1–4.2 (Experiments): Quantitative results (EPE, D1, etc.) are presented for synthetic and real-world datasets, but the data-efficiency experiments lack ablations isolating the contribution of the RSNN iterative refinement versus the spike input alone, weakening attribution of the reported gains.

minor comments (2)

[§2] §2 (Related Work): A few recent spike-based stereo or depth works are omitted from the literature review.
[Figure 4] Figure 4 (Qualitative Results): Depth map visualizations would benefit from explicit scale bars or error heatmaps to aid interpretation of challenging regions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance for event-based stereo vision. We address each major comment in detail below and have revised the manuscript accordingly to improve clarity and strengthen the experimental analysis.

read point-by-point responses

Referee: [§3.2] §3.2 (Input Fusion and RSNN Module): The central claim of direct estimation from raw spike streams without frame-like representations or hand-crafted features is load-bearing but not fully secured. The architecture description does not explicitly rule out temporal binning, rate coding, or accumulation steps before the RSNN update module receives input; any such preprocessing would contradict the 'raw spike streams' and 'no explicit conversion' assertions.

Authors: We thank the referee for this important observation. The manuscript's Section 3.2 describes the input fusion as directly concatenating the asynchronous spike events from the stereo spike cameras, with each event carrying its native microsecond timestamp and no conversion to frames or rates. The RSNN update module receives these raw binary spike trains and processes them via its recurrent spiking dynamics without intermediate binning or accumulation. To remove any ambiguity, we have added explicit language in the revised Section 3.2 stating that the pipeline contains no temporal binning, rate coding, or hand-crafted feature extraction prior to the RSNN, thereby reinforcing the 'raw spike streams' and 'no explicit conversion' claims. revision: yes
Referee: [§4.1–4.2] §4.1–4.2 (Experiments): Quantitative results (EPE, D1, etc.) are presented for synthetic and real-world datasets, but the data-efficiency experiments lack ablations isolating the contribution of the RSNN iterative refinement versus the spike input alone, weakening attribution of the reported gains.

Authors: We agree that isolating the RSNN's iterative refinement from the raw spike input would strengthen attribution of the data-efficiency gains. We have added new ablation experiments in the revised Section 4.2 that compare the full SpikeStereoNet against a non-recurrent variant using only the initial spike fusion (no RSNN updates) across training data fractions from 10% to 100%. These results, reported with the same EPE and D1 metrics, show that the iterative RSNN refinement accounts for the majority of the performance retention under reduced data, directly addressing the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: SpikeStereoNet is a new architecture with empirical validation on introduced datasets

full rationale

The paper introduces SpikeStereoNet as a novel brain-inspired RSNN-based framework for direct stereo depth estimation from raw spike streams, along with new synthetic and real-world datasets. Claims rest on architectural design choices and performance comparisons rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central result to prior inputs. No derivation chain is presented that equates outputs to inputs by construction; the work is self-contained against external benchmarks via reported experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that raw asynchronous spike events carry usable stereo disparity information and that a recurrent spiking network can iteratively extract depth without intermediate frame reconstruction. No explicit free parameters are named in the abstract, but the RSNN module implies trainable weights fitted to the new datasets.

free parameters (1)

RSNN trainable weights
The recurrent spiking neural network update module contains parameters that are learned from the introduced training data.

axioms (1)

domain assumption Raw spike streams from two viewpoints contain sufficient information for direct stereo depth estimation without frame conversion.
The framework is built on the assumption that asynchronous spike events alone are adequate input for the RSNN.

invented entities (1)

RSNN update module no independent evidence
purpose: Iteratively refines depth estimates from fused spike streams
A new recurrent spiking component introduced specifically for this stereo task.

pith-pipeline@v0.9.0 · 5735 in / 1442 out tokens · 60404 ms · 2026-05-19T14:06:33.649394+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

αt=σ(Conv([st−1,xt],Wα)+cα), … st=f(st−1,s(l−1)t,αt,βt,γt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.