NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

Artur Podobas; Muhammad Ihsan Al Hafiz

arxiv: 2604.28059 · v2 · pith:HNQL6W6Snew · submitted 2026-04-30 · 💻 cs.AR · cs.DC· cs.NE

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

Muhammad Ihsan Al Hafiz , Artur Podobas This is my paper

Pith reviewed 2026-05-07 06:13 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.NE

keywords spiking neural networksFPGA accelerationring topologystream dataflowcortical microcircuitreal-time factorenergy efficiencyscalability

0 comments

The pith

A bidirectional ring topology and stream-dataflow architecture on FPGAs enables scalable faster-than-real-time execution of large spiking neural networks while preserving activity statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NeuroRing as a modular accelerator for spiking neural networks that uses a bidirectional ring topology to connect multiple FPGAs and a stream-dataflow architecture to process sparse spike events. This design is implemented through high-level synthesis on programmable FPGAs and supports both single- and multi-device setups while remaining compatible with standard simulation tools. Evaluation on the cortical microcircuit benchmark shows that the system maintains the key activity statistics of the reference model and reaches a real-time factor of 0.83 on two FPGAs, indicating faster-than-real-time performance. It also exhibits meaningful strong and weak scaling along with competitive energy efficiency. These outcomes indicate that reconfigurable hardware can support large-scale event-driven simulations without fixed-function custom chips.

Core claim

NeuroRing implements a stream-dataflow architecture over a bidirectional ring topology on programmable FPGAs to accelerate spiking neural networks. The approach supports modular single- and multi-FPGA deployment and preserves the activity statistics of the reference cortical microcircuit model while achieving a real-time factor of 0.83 on two devices, along with strong and weak scaling and competitive energy efficiency.

What carries the argument

Bidirectional ring topology for inter-device communication combined with stream-dataflow architecture for efficient handling of sparse spike events.

If this is right

The modular design allows adding more FPGAs to handle larger networks while keeping the same performance characteristics.
Integration with existing simulation tools enables direct use in neuroscience workflows without major code changes.
The achieved real-time factor of 0.83 on two devices demonstrates that reconfigurable hardware can outperform real-time biological timescales for full-scale models.
Competitive energy efficiency positions the approach as viable for sustained large-scale event-driven computations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ring topology's benefits for sparse communication could extend to other event-driven hardware if the dataflow principles prove platform-independent.
Testing with varied sparsity patterns would reveal whether the architecture needs tuning for different application domains beyond neural circuits.
Hybrid systems combining this FPGA setup with conventional processors might handle mixed workloads more flexibly than either alone.

Load-bearing premise

The bidirectional ring topology and stream-dataflow architecture continue to avoid communication bottlenecks and synchronization overhead when scaled beyond two FPGAs or applied to networks with different spike sparsity patterns.

What would settle it

Running the cortical microcircuit on three or more FPGAs or on a network with denser spikes and observing a real-time factor above 1.0 or clear deviations in activity statistics would show the design fails to scale as claimed.

Figures

Figures reproduced from arXiv: 2604.28059 by Artur Podobas, Muhammad Ihsan Al Hafiz.

**Figure 1.** Figure 1: Architecture of our proposed NeuroRing core, showing (left) the overarching core architecture, and (right) details of the NPU and the accumulator structure. the NPU generates a global synchronization token that propagates through the right ring to signal the timestep’s completion across all cores. During this period, the spike recorder writes spike traces back to HBM. The HBM interface uses one AXI4 maste… view at source ↗

**Figure 2.** Figure 2: Bidirectional ring topology and experiment setup. replicated and connected to neighboring cores via their left and right AXI-stream ports. Within a single device, these connections form a closed bidirectional ring. Across devices, the ring is extended through high-speed serial links. Figure 2a shows the system-level organization. Each device contains multiple NeuroRing cores and two Aurora kernels located … view at source ↗

**Figure 3.** Figure 3: Layer-wise raster plots of the cortical microcircuit. 0 10 20 0.0 0.2 0.4 0.6 0.8 p L23E NEST NeuroRing 0 10 20 0.0 0.2 0.4 0.6 0.8 L23I 0.0 0.5 1.0 1.5 0 1 2 3 L23E 0.0 0.5 1.0 1.5 0 1 2 3 L23I 0.00 0.05 0 25 50 75 100 125 L23E 0.00 0.05 0 25 50 75 100 125 L23I 0 10 20 0.0 0.2 0.4 0.6 0.8 p L4E 0 10 20 0.0 0.2 0.4 0.6 0.8 L4I 0.0 0.5 1.0 1.5 0 1 2 3 L4E 0.0 0.5 1.0 1.5 0 1 2 3 L4I 0.00 0.05 0 25 50 75 100… view at source ↗

**Figure 4.** Figure 4: Layer-wise statistical comparison for the full-size cortical microcircuit. 6.3 Architectural Design Space Exploration This experiment explores how neuron capacity per core affects real-time performance, power, and energy efficiency under a full-size cortical microcircuit workload on a fixed two-FPGA deployment view at source ↗

**Figure 5.** Figure 5: Impact of core capacity variation on RTF, Power, and Energy is lower than that of the 4096- and 8192-neuron/core cases. This indicates that the performance trend is influenced not only by the number of cores but also by the achievable clock frequency after implementation. Figure 5b shows that lower core capacity increases total power consumption, from about 70 W for the 8192- and 5632-neuron/core cases to … view at source ↗

**Figure 6.** Figure 6: Strong-scaling comparison for the Half cortical microcircuit between NeuroRing and the Dardel CPU system. Quarter (5 Cores) (1 FPGA) 304.1 MHz Half (10 Cores) (1 FPGA) 203.5 MHz Full (20 Cores) (2 FPGAs) 272.4 MHz 0.4 0.6 0.8 1 Hardware Shift to 2 FPGAs 0.52 0.73 0.83 RTF a) Weak Scaling RTF Quarter (5 Cores) (1 FPGA) 304.1 MHz Half (10 Cores) (1 FPGA) 203.5 MHz Full (20 Cores) (2 FPGAs) 272.4 MHz 0 50 100… view at source ↗

**Figure 7.** Figure 7: Performance weak scaling of the NeuroRing architecture (4096 Neurons/core). FPGA communication, while for Dardel, it is associated with inter-node MPI communication. Figures 6b and 6c show that the reduction in runtime is accompanied by higher power and energy cost on both platforms. For NeuroRing, total power increases from 32.71 W to 79.17 W, and energy per synaptic event rises from 49 nJ to 84 nJ when … view at source ↗

**Figure 8.** Figure 8: Three Sudoku benchmarks. Gray cells: given puzzle. Blue cells: solver solution. 6.6 Beyond Neuroscience: Constraint Satisfaction Problem To demonstrate that NeuroRing is not limited to neuroscience workloads, we apply it to Sudoku, formulated here as an SNN winner-takes-all network as described in Section 5. Three Sudoku benchmark puzzles are executed on a single FPGA using one NeuroRing core with 4096 ne… view at source ↗

read the original abstract

Spiking neural networks (SNNs) are a promising paradigm for energy-efficient event-driven computation, but large-scale SNN execution remains challenging because sparse spike communication and synchronization can dominate runtime. Existing solutions across CPU, GPU, ASIC, and FPGA platforms offer different trade-offs between programmability, efficiency, and scalability. To address this gap, we present NeuroRing, a modular and scalable SNN accelerator based on a stream-dataflow architecture and a bidirectional ring topology, implemented in High-Level Synthesis (HLS) on FPGAs. NeuroRing supports modular single- and multi-FPGA deployment and is compatible with existing SNN workflows through integration with the NEST simulator. We evaluate NeuroRing on the cortical microcircuit benchmark and a Sudoku constraint-satisfaction workload. Results show that NeuroRing preserves the key activity statistics of the NEST reference model, achieves faster-than-real-time execution of the full-scale cortical microcircuit with a real-time factor (RTF) of 0.83, exhibits meaningful strong and weak scaling, and provides competitive energy efficiency on two programmable FPGAs. These results position NeuroRing as a flexible and scalable platform for both neuroscience simulation and broader event-driven applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuroRing gets a working two-FPGA NEST-compatible SNN accelerator to RTF 0.83 with preserved statistics, but the multi-FPGA scaling claims rest on thin evidence from just that setup.

read the letter

Colleague, NeuroRing is a concrete FPGA implementation that hooks into the NEST simulator and runs the full cortical microcircuit faster than real time on two boards while keeping the key activity statistics the same. The energy numbers are competitive for the hardware they used. That combination of simulator compatibility and measured performance on a standard benchmark is the part that makes the work usable right away for people who already run models in NEST. The Sudoku workload adds a second data point showing the design is not locked to one neuroscience case. The actual new element is the specific pairing of a bidirectional ring topology with a stream-dataflow fabric, all generated through HLS, for modular single- or multi-FPGA SNN deployment. They ship something that can be dropped into existing workflows without forcing a full rewrite of the model side. The soft spot is the scaling argument. The abstract states that the design exhibits meaningful strong and weak scaling and supports modular multi-FPGA use, yet every reported number comes from a two-FPGA configuration. A ring has hop count that grows with the number of boards, and spike traffic in the cortical microcircuit is sparse but temporally clustered, so any per-hop delay in the dataflow fabric can accumulate. The paper gives no hop-latency measurements, no communication-volume bounds, and no results past two boards. That leaves the extrapolation to larger deployments as the weakest link. This paper is for hardware researchers who prototype event-driven accelerators on FPGAs and want to keep a direct path back to established simulators. A reader building neuromorphic systems or scaling SNNs on reconfigurable hardware would find the implementation choices and the NEST integration worth examining. I would send it to peer review. The core implementation is real, the metrics are checkable, and the NEST hook is a practical advantage even if the scaling section needs more data or analysis to stand on its own for bigger systems.

Referee Report

1 major / 2 minor

Summary. The manuscript presents NeuroRing, a modular SNN accelerator based on a bidirectional ring topology and stream-dataflow architecture implemented in HLS on programmable FPGAs. It integrates with the NEST simulator and is evaluated on the cortical microcircuit benchmark and a Sudoku workload, claiming preservation of key activity statistics from the NEST reference, faster-than-real-time execution with RTF of 0.83, meaningful strong and weak scaling, and competitive energy efficiency on a two-FPGA deployment.

Significance. If the multi-FPGA scaling claims hold, NeuroRing would offer a flexible, programmable platform that combines the workflow compatibility of software simulators like NEST with hardware acceleration for event-driven SNNs, potentially advancing large-scale neuroscience modeling and broader event-driven applications. The HLS implementation and modular design are strengths that support reproducibility and adaptability.

major comments (1)

[§5 (Evaluation)] §5 (Evaluation): All reported quantitative results (RTF of 0.83, energy efficiency, and claims of 'meaningful strong and weak scaling' plus 'modular single- and multi-FPGA deployment') derive exclusively from a two-FPGA configuration. No analytic bound on communication volume, measured per-FPGA hop latency, or weak-scaling experiments that vary FPGA count while holding neurons per FPGA constant are provided. This is load-bearing for the central claim, as a bidirectional ring has O(n) worst-case hop count and cortical microcircuit spike traffic is sparse but temporally correlated, risking accumulated delays that could alter RTF or spike timing for n>2.

minor comments (2)

[Abstract] Abstract: The claim that NeuroRing 'preserves the key activity statistics' should explicitly name the compared measures (e.g., firing rates, burst statistics, or pairwise correlations) rather than leaving them implicit.
[§3 (Architecture)] §3 (Architecture): The stream-dataflow fabric description would benefit from a diagram or pseudocode clarifying how spike packets are serialized and routed to avoid ambiguity in the bidirectional ring implementation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading of the manuscript and the focused comment on the evaluation of scaling behavior. We address the points raised below and indicate the revisions we will make to strengthen the presentation of results.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation): All reported quantitative results (RTF of 0.83, energy efficiency, and claims of 'meaningful strong and weak scaling' plus 'modular single- and multi-FPGA deployment') derive exclusively from a two-FPGA configuration. No analytic bound on communication volume, measured per-FPGA hop latency, or weak-scaling experiments that vary FPGA count while holding neurons per FPGA constant are provided. This is load-bearing for the central claim, as a bidirectional ring has O(n) worst-case hop count and cortical microcircuit spike traffic is sparse but temporally correlated, risking accumulated delays that could alter RTF or spike timing for n>2.

Authors: We agree that the primary quantitative results, including the RTF of 0.83 and energy-efficiency measurements, are reported for the two-FPGA configuration; this was the largest setup available in our laboratory. The manuscript does demonstrate modularity by directly comparing single-FPGA and two-FPGA executions of the identical cortical microcircuit, which constitutes a strong-scaling result for a fixed total neuron count. For weak scaling, the Sudoku workload experiments vary problem size on the available hardware, but we did not perform additional runs that increase FPGA count while holding neurons per FPGA constant beyond the two-device case. We will revise §5 to explicitly state the scope of the evaluated configurations and to qualify the scaling claims accordingly. We will also add an analytic bound on communication volume under the bidirectional ring and stream-dataflow model, together with estimated per-FPGA hop latencies derived from the topology and the pipelined spike transfer mechanism. On the risk of accumulated delays for n>2, the architecture pipelines transfers in both directions of the ring; because cortical connectivity is sparse and predominantly local, the average hop distance remains small even as ring size grows. We will include a short discussion of these factors and their implications for spike timing in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware evaluation with direct measurements

full rationale

The paper describes a concrete FPGA implementation of a stream-dataflow SNN accelerator using a bidirectional ring topology, evaluated empirically on the cortical microcircuit and Sudoku workloads. All quantitative claims (RTF of 0.83, preservation of NEST activity statistics, strong/weak scaling, energy efficiency) are presented as measured outcomes from the two-FPGA prototype rather than as predictions derived from equations, fitted parameters, or self-citations. No derivation chain, ansatz, uniqueness theorem, or renaming of known results appears; the central results are self-contained experimental data and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the new hardware design faithfully reproduces NEST dynamics and that ring communication scales without hidden costs; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption NEST cortical microcircuit model produces biologically plausible activity statistics that serve as ground truth
The paper treats preservation of NEST statistics as validation of correctness.

invented entities (1)

NeuroRing architecture no independent evidence
purpose: Modular multi-FPGA SNN accelerator using bidirectional ring and stream-dataflow
New named system introduced to solve scalability and communication issues in SNN execution.

pith-pipeline@v0.9.0 · 5526 in / 1319 out tokens · 55711 ms · 2026-05-07T06:13:08.794098+00:00 · methodology

NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)