arxiv: 2605.12049 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.IT· cs.NE· math.IT

Recognition: no theorem link

Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

Aaron Spieler , Georg Martius , Anna Levina

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITcs.NEmath.IT

keywords recurrent networksscaling lawsexpressive neuronsparameter tradeoffsinformation theorysequence modelingELM networksneuromorphic benchmarks

0 comments

The pith

Allocating a fixed parameter budget between neuron count, complexity, and connectivity in recurrent networks produces a non-trivial optimum that shifts toward more complex neurons as the budget grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how to divide a fixed parameter budget P among the number of recurrent units N, each unit's effective complexity k_e, and its connectivity k_c. It introduces ELM neurons that allow these three quantities to be tuned independently while training stably over wide ranges of scale. Experiments on two sequence tasks show that performance rises with each dimension separately, yet under a constant total budget a clear tradeoff optimum appears; larger budgets move the optimum toward higher per-neuron complexity. A closed-form information-theoretic model accounts for the observed diminishing returns by attributing them to per-neuron signal-to-noise saturation and population-level redundancy.

Core claim

Under a fixed parameter budget, recurrent networks built from ELM neurons exhibit a non-trivial optimum in the allocation of units, per-unit complexity, and connectivity; larger budgets shift this optimum toward greater per-neuron complexity. The tradeoffs are captured by a closed-form information-theoretic model that attributes diminishing returns to per-neuron signal-to-noise saturation at high complexity and across-neuron redundancy at high connectivity or low complexity.

What carries the argument

The ELM neuron, an Expressive Leaky Memory unit whose design permits independent tuning of effective complexity k_e and connectivity k_c while maintaining stable training across scales.

If this is right

Performance increases monotonically when varying N, k_e, or k_c individually.
The optimal balance shifts toward higher k_e as total parameters grow.
The information-theoretic model predicts the locations of the performance peaks.
Sweeps over three orders of magnitude in parameters trace a consistent scaling surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same allocation principles may apply to other sequence architectures beyond recurrent networks.
Biological cortical neurons may have evolved their complexity to optimize similar efficiency tradeoffs under resource constraints.
New benchmarks could test whether the identified optimum generalizes beyond the SHD-Adding and Enwik8 tasks.

Load-bearing premise

The ELM neuron design permits truly independent control of complexity and connectivity without unintended interactions, and the two sequence benchmarks suffice to establish a general scaling law.

What would settle it

A direct test would be to measure whether performance on additional sequence modeling tasks continues to favor increasingly complex neurons as the total parameter count rises beyond the range explored, or whether the predicted information-theoretic curves match observed error rates when k_e or k_c are varied independently.

Figures

Figures reproduced from arXiv: 2605.12049 by Aaron Spieler, Anna Levina, Georg Martius.

**Figure 1.** Figure 1: Complexity of cortical neurons motivates search for the optimal complexity of a unit in the parameter-constrained recurrent networks. a) Cortical neurons combine complex dendritic structure with rich internal dynamics, making them powerful spatio-temporal processing units. b) Using ELM neurons [1], from simple integrators with two memory units and a few parameters ksimple to large models exceeding the comp… view at source ↗

**Figure 2.** Figure 2: A stable and flexible model system for studying scaling and tradeoffs in recurrent networks of expressive neurons. a) The modified Expressive Leaky Memory (ELM) neuron. b) ELM neurons are assembled as ELM Network, a doubly recurrent sequence model whose computational core is a wide recurrent hidden layer in which each neuron itself is recurrent; followed by a smaller readout layer and output projection. c)… view at source ↗

**Figure 3.** Figure 3: More neurons or more complex neurons each improve performance on the SHDAdding task; under a fixed budget, a clear non-trivial optimum emerges. Reference model (triangle) was chosen to be below saturation along all dimensions. Test accuracy improves with a) the number of neurons Nrec and b) neuron complexity ke ∼ d 2 m with dmlp = 2dm. c) Under fixed parameter budget, a clear optimum emerges in the number… view at source ↗

**Figure 4.** Figure 4: Monotonic scaling gains extend across vast network sizes; larger budgets favor both more and more complex neurons, and connectivity introduces new tradeoff dimensions. Enwik8: character-level language modeling (test BPC, lower is better) [54]. Reference model marked with a triangle. Exemplary network activity in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The Effective Representation Information derived from viewing neural layers as noisy channels is motivated by residual-error scaling in single neurons. a) Neural activations may be viewed as a noisy multi-channel representation of their inputs, with Irep quantifying how much taskrelevant information they transmit. b) Per-neuron complexity determines the residual variance σ 2 n (ke), a decreasing function … view at source ↗

**Figure 6.** Figure 6: The theoretical model qualitatively reproduces the empirical scaling tradeoffs and links their shifts to measurable architectural quantities. a–f) Enwik8 experiments (top, test BPC) are paired with corresponding theory ablations (bottom, −Irep) obtained via joint fit with shared reference parameters (Pearson r = 0.98). Theory optima marked with a golden star. In all panels, reference model ds/50 = 15, τmax… view at source ↗

**Figure 7.** Figure 7: A simple joint-scaling heuristic consistent with theory traces the empirical Pareto frontier across both datasets and three orders of magnitude. We perform a structured largescale hyper-parameter search including dmlp, dtree and dbranch. Mean across three runs reported on SHD-Adding and single run performance on Enwik8. a,b) On both datasets networks with more and more complex neurons become optimal as bu… view at source ↗

**Figure 8.** Figure 8: Training drives ELM Networks into a sparse, mostly asynchronous, irregular activity regime. Example inference on Enwik8 of a reference model configuration with N = 1024 and dm = 15. a, b) Individual neurons’ activity is characterized by brief spike like above-threshold activations, and high-frequency sub-threshold fluctuations. At the population level, ∼10% of neurons are active at any given time, firing a… view at source ↗

**Figure 9.** Figure 9: Proportional connectivity and deeper dendritic integration scale better even accounting for additional parameter cost: ablations of number of neurons and neuron complexity on Enwik8 matching Figure 4a,b, with x-axis plotting in terms of total network trainable parameters. While curves rescale, the same trend emerges; a) scaling with number of neurons works better with proportional neuron connectivity, and … view at source ↗

**Figure 10.** Figure 10: ELM Network training is smooth and gradients remain stable throughout: Training dynamics of the reference run with N=1024 and dm=15 in Figures 6. a) Train and valid BPC over 750 training turns, converging near 1.644 valid BPC slightly below test BPC of 1.65. b) Min and max parameter gradient norms, logged every 50th gradient step for 5000 samples, remain stable throughout training with no signs of explodi… view at source ↗

**Figure 11.** Figure 11: A wide range of high-pass filter timescales stabilize training and performance: Ablation of τr on Enwik8 for the reference architecture with N=1024 and dm=15. Training remains stable and yields similar performance over an order of magnitude in τr. Training runs with too large τr become unstable, ones with too small timescale remove all signal from neuron output. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_… view at source ↗

**Figure 12.** Figure 12: The qualitative trade-off between number of neurons and complexity also persists for feed-forward networks: We evaluate the N vs ke tradeoff for purely feed-forward ELM-Network with ρrec = 0.0 on Enwik8. Note that individual ELM Neurons remain internally recurrent. We likewise observe the emergence of nontrivial optima where network configurations with intermediate ELM Neuron complexity perform best, that… view at source ↗

**Figure 13.** Figure 13: The ELM-Network benefits from L2 neuron MLP output and L1 network activity regularization, without changing the qualitative tradeoffs: Regularizer setup described in Appendix A.2. a) Compared to the default regularizer setup (blue curve), disabling the L1 regularizer on network activity (orange curve) slightly degrades performance for larger networks. Additionally disabling the L2 regularizer on the neur… view at source ↗

**Figure 14.** Figure 14: The ELM-NET architecture exhibits rich temporal dynamics, characterized by periods of asynchronous irregular firing, synchronized bursts, and network silence: An example network inference of an ELM Network with 128 neurons on SHD-Adding (see Appendix A.2). The network’s hidden layer displays short spike or burst like activity, with more active network phases visibly correlating to high input activity peri… view at source ↗

**Figure 15.** Figure 15: Connectivity introduces the same qualitative tradeoffs on SHD-Adding as on Enwik8: Estimated dataset noise floor at 88% marked with dashed line. a) Increasing synapses per branch dbranch improves performance with diminishing returns. b) An optimal recurrent fraction ρrec ≈ 0.30 exists, which is roughly proportional to the ratio of recurrent to recurrent plus input connections. c) The tradeoff between neur… view at source ↗

**Figure 16.** Figure 16: Joint theory-experiment fit across seven experiments. Each point is one (Nrec, dm) configuration; the seven experiments span three parameter budgets and two ablation pairs targeting α (τm,max, lmlp) and β (dbranch). Eight theory parameters, the four shared reference values plus perexperiment variants for each ablated quantity, and a single affine map are fit jointly to all 103 points by minimizing the su… view at source ↗

**Figure 17.** Figure 17: A single neuron’s reduction of approximation error is well described as a power law in parameter count. Various sized ELM Neuron fit to NeuronIO with reducible MAE reported as mean over three seeds. Decay curves fitted in log-log. a) The reduction in error is compatible with a power law decay across two orders of magnitude in per-neuron parameters ke, strongly preferred over exponential-decay alternatives… view at source ↗

**Figure 18.** Figure 18: The ELM neuron voltage prediction residuals on NeuronIO have some state and temporal dependence: Residuals for a neuron model with dm = 3 fitted on the NeuronIO dataset containing single neuron voltage recordings. Note that the underlying target membrane voltage data itself displays multiple operating regimes, with particularly violent dynamics towards spiking threshold (≈ −60mV ). a) Voltage-prediction r… view at source ↗

**Figure 19.** Figure 19: Power-law structure in the ELM neurons’ output. Function fitting performed in log-log. Individual neuron output measured at memory readout w T mt as it still contains the task-relevant slow signal components. Eigenvalues computed across 50 distinct recordings of 512 steps, after discarding 128-step burn-in. Recordings from the reference model with Nrec = 1024 on Enwik8. a) The neuron’s output covariance e… view at source ↗

**Figure 20.** Figure 20: Empirical neuron complexity scaling on Enwik8 independently validates the max noise floor assumption: Reducible test BPC vs. per-neuron effective parameter count ke, with layer width kept fixed at N = 1024 but increasing layer budgets. Functions fitted in log-log, with floor function slopes seeded with pure power law fit, and floors seeded with last evaluation point. Model comparison uses the corrected Ak… view at source ↗

**Figure 21.** Figure 21: Gaussianity of the readout w ⊤ r mt of neurons on Enwik8. Only y = f +n is observable, so Gaussianity is tested on y directly. Data plotted from 50 distinct recordings of 512 steps, after discarding 128-step burn-in. Recordings from the reference model with Nrec = 1024 on Enwik8. a) Pooled marginal of per-neuron z-scored activity vs. N (0, 1), shape-only. b) Per-neuron (skew, excess kurtosis) with Gaussia… view at source ↗

read the original abstract

Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELM neurons give a workable way to sweep N, ke, and kc separately on sequence tasks, but the claimed independence and the closed-form attribution both look fragile.

read the letter

The paper's core move is to define an ELM neuron so that effective complexity per unit and connectivity per unit can be adjusted without immediately changing width. They then fix total parameters P and trace how performance on SHD-Adding and Enwik8 changes when they vary N, ke, and kc. Performance rises with each axis taken alone, yet under fixed P there is a non-monotonic sweet spot, and bigger budgets shift the optimum toward higher ke. A simple information-theoretic expression is offered to explain the two ends of the curve as signal-to-noise saturation inside units and redundancy across units. That framing is new enough to be worth a look, and the empirical sweeps across three orders of magnitude in P are the part that actually lands. The architecture trains stably at those scales, which is not automatic for more expressive recurrent units. The closed-form model matches the observed surface, at least on the reported plots. The soft spot is exactly the one the stress-test flags: the leaky integrator that sets ke still feeds the same state that the connectivity matrix acts on, so any shared normalization or gradient path can couple ke and kc even if the authors intended them to be orthogonal. If that coupling is present, the three-axis surface is no longer cleanly separable and the attribution to signal-to-noise versus redundancy becomes post-hoc. The two benchmarks are narrow; neither stresses long-range dependencies or noisy inputs that would expose hidden interactions. The paper is therefore for people who already care about scaling laws inside recurrent architectures or about normative models of cortical complexity. It is not yet ready to change practice, but the question it poses is concrete and the sweeps are large enough that a referee could usefully press on the independence assumption and the model derivation. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces Expressive Leaky Memory (ELM) neurons for recurrent networks, allowing independent tuning of neuron count N, per-neuron effective complexity k_e, and per-neuron connectivity k_c within a fixed parameter budget P. On SHD-Adding and Enwik8 benchmarks, performance improves monotonically along each axis separately; under fixed P a non-monotonic optimum appears, with larger budgets favoring both higher N and higher k_e. A closed-form information-theoretic model is derived to explain the observed tradeoffs, attributing diminishing returns to per-neuron signal-to-noise saturation at high k_e and across-neuron redundancy at high N.

Significance. If the claimed independence of k_e and k_c holds and the closed-form model is predictive rather than post-hoc, the work supplies a concrete scaling-law framework that challenges the default use of simple units in recurrent architectures and offers a normative account of why biological circuits employ complex spatio-temporal integrators. The three-axis sweep spanning three orders of magnitude in P and the explicit attribution of the two saturation regimes are the strongest contributions.

major comments (3)

[ELM neuron definition and hyperparameter sweep] The central claim that N, k_e, and k_c can be varied independently rests on the ELM neuron definition. The leaky-integrator dynamics and recurrent connectivity matrix share the same hidden state; any normalization, initialization, or gradient flow through the leak parameter could induce statistical dependence between effective complexity and effective connectivity. Without an explicit orthogonality test (e.g., ablation of leak time-constant while holding connectivity matrix statistics fixed, or measurement of mutual information between the two axes), the three-axis sweep does not necessarily trace an orthogonal tradeoff surface, rendering the closed-form model and the conclusion that “larger budgets favor both more and more complex neurons” potentially post-hoc.
[Closed-form information-theoretic model] The information-theoretic model is presented as capturing the observed tradeoffs. If its free parameters (signal-to-noise saturation threshold, redundancy coefficient) are fitted to the same hyperparameter sweeps they are meant to explain, the derivation reduces to a descriptive fit rather than a predictive, parameter-free account. The manuscript should report whether the model parameters were fixed a priori from information-theoretic considerations or optimized on the same data.
[Experimental evaluation] The two benchmarks (SHD-Adding and Enwik8) are qualitatively different, yet the scaling-law claim is stated generally. No cross-benchmark consistency check or additional task (e.g., a long-range dependency or continuous-control task) is reported to establish that the non-monotonic optimum and the two saturation regimes are not benchmark-specific.

minor comments (2)

Notation for k_e and k_c should be defined once at first use and used consistently; the abstract and main text occasionally switch between “effective complexity” and “per-unit complexity” without explicit mapping.
Figure captions for the scaling-law plots should include the exact ranges of N, k_e, and k_c explored and the total parameter count P for each point.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each of the major comments below and have updated the manuscript accordingly to strengthen the claims regarding parameter independence and model derivation.

read point-by-point responses

Referee: [ELM neuron definition and hyperparameter sweep] The central claim that N, k_e, and k_c can be varied independently rests on the ELM neuron definition. The leaky-integrator dynamics and recurrent connectivity matrix share the same hidden state; any normalization, initialization, or gradient flow through the leak parameter could induce statistical dependence between effective complexity and effective connectivity. Without an explicit orthogonality test (e.g., ablation of leak time-constant while holding connectivity matrix statistics fixed, or measurement of mutual information between the two axes), the three-axis sweep does not necessarily trace an orthogonal tradeoff surface, rendering the closed-form model and the conclusion that “larger budgets favor both more and more complex neurons” potentially post-hoc.

Authors: The ELM neuron parameterization explicitly separates the leak time constants, which determine the per-neuron effective complexity k_e through multi-timescale integration, from the recurrent connectivity matrix that sets k_c. By construction, these are controlled by distinct sets of parameters, and the hyperparameter sweeps were designed to vary them independently while keeping the total parameter count P fixed. To directly address the potential for induced dependence, we have added an orthogonality analysis in the revised manuscript, including an ablation where leak parameters are varied while holding connectivity statistics fixed, confirming that the effective axes remain largely orthogonal. This supports the validity of the three-axis sweep and the scaling conclusions. revision: yes
Referee: [Closed-form information-theoretic model] The information-theoretic model is presented as capturing the observed tradeoffs. If its free parameters (signal-to-noise saturation threshold, redundancy coefficient) are fitted to the same hyperparameter sweeps they are meant to explain, the derivation reduces to a descriptive fit rather than a predictive, parameter-free account. The manuscript should report whether the model parameters were fixed a priori from information-theoretic considerations or optimized on the same data.

Authors: The parameters in the closed-form model, including the signal-to-noise saturation threshold and redundancy coefficient, were determined a priori based on information-theoretic bounds on neuron capacity and redundancy in recurrent networks, derived from standard results on mutual information in noisy channels and population coding. They were not optimized on the experimental data. We have revised the manuscript to explicitly state the a priori derivation and the specific values used, along with a sensitivity analysis showing robustness. revision: yes
Referee: [Experimental evaluation] The two benchmarks (SHD-Adding and Enwik8) are qualitatively different, yet the scaling-law claim is stated generally. No cross-benchmark consistency check or additional task (e.g., a long-range dependency or continuous-control task) is reported to establish that the non-monotonic optimum and the two saturation regimes are not benchmark-specific.

Authors: The SHD-Adding and Enwik8 tasks were selected precisely because they differ substantially in input modality, temporal structure, and task demands—one being a neuromorphic spike-based addition task and the other a character-level language modeling benchmark. The observed scaling laws, including the non-monotonic optimum under fixed P and the two saturation regimes, are consistent across both, as detailed in the results section. While we acknowledge that further validation on additional tasks such as long-range dependency benchmarks would be valuable, the qualitative differences between the current pair provide support for the generality of the framework. We have expanded the discussion section to include a cross-benchmark consistency analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the ELM neuron design to enable independent variation of N, k_e and k_c, reports monotonic improvements along each axis, identifies a non-monotonic optimum under fixed budget, and presents a closed-form information-theoretic model that attributes diminishing returns to signal-to-noise saturation and redundancy. No equation or description in the abstract or provided text shows that the closed-form model is obtained by fitting parameters to the same hyperparameter sweeps it explains, nor does any step reduce a claimed prediction to its inputs by construction. The scaling-law trace is described as consistent with the framework rather than derived from it tautologically. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that ELM neurons provide independent tunable complexity and that the information-theoretic model accurately attributes the observed diminishing returns without circular fitting.

free parameters (2)

per-unit effective complexity k_e
Tuned independently as one of the three allocation axes in the fixed-budget experiments.
per-unit connectivity k_c
Tuned independently as one of the three allocation axes in the fixed-budget experiments.

axioms (1)

domain assumption ELM neurons mirror functional components of cortical neurons
Explicitly stated as the design choice for the recurrent layer.

invented entities (1)

Expressive Leaky Memory (ELM) neuron no independent evidence
purpose: To allow independent adjustment of per-unit complexity k_e separately from width N and connectivity k_c
Newly introduced architecture component whose properties are not derived from prior literature in the abstract.

pith-pipeline@v0.9.0 · 5627 in / 1408 out tokens · 39389 ms · 2026-05-13T06:23:59.185813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

[1]

The expressive leaky memory neuron: an efficient and expressive phenomenological neuron model can solve long- horizon tasks

Aaron Spieler, Nasim Rahaman, Georg Martius, Bernhard Schölkopf, and Anna Levina. The expressive leaky memory neuron: an efficient and expressive phenomenological neuron model can solve long- horizon tasks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vE1e1mLJ0U

work page 2024
[2]

Are dendrites conceptually useful?Neuroscience, 2022

Matthew Larkum. Are dendrites conceptually useful?Neuroscience, 2022

work page 2022
[3]

What makes human cortical pyramidal neurons functionally complex.bioRxiv, pages 2024–12, 2024

Ido Aizenbud, Daniela Yoeli, David Beniaguev, Christiaan PJ de Kock, Michael London, and Idan Segev. What makes human cortical pyramidal neurons functionally complex.bioRxiv, pages 2024–12, 2024

work page 2024
[4]

Illuminating dendritic function with computational models

Panayiota Poirazi and Athanasia Papoutsi. Illuminating dendritic function with computational models. Nature Reviews Neuroscience, 21(6):303–321, 2020

work page 2020
[5]

Springer Singapore, Singapore, 2019

Snehashish Chakraverty, Deepti Moyi Sahoo, and Nisha Rani Mahato.McCulloch–Pitts Neural Network Model, pages 167–173. Springer Singapore, Singapore, 2019. ISBN 978-981-13-7430-2. doi: 10.1007/ 978-981-13-7430-2_11. URLhttps://doi.org/10.1007/978-981-13-7430-2_11

work page doi:10.1007/978-981-13-7430-2_11 2019
[6]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

xlstm: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. 37, 2024

work page 2024
[9]

Which model to use for cortical spiking neurons?IEEE transactions on neural networks, 15(5):1063–1070, 2004

Eugene M Izhikevich. Which model to use for cortical spiking neurons?IEEE transactions on neural networks, 15(5):1063–1070, 2004

work page 2004
[10]

Adaptive exponential integrate-and-fire model as an effective description of neuronal activity.Journal of neurophysiology, 94(5):3637–3642, 2005

Romain Brette and Wulfram Gerstner. Adaptive exponential integrate-and-fire model as an effective description of neuronal activity.Journal of neurophysiology, 94(5):3637–3642, 2005

work page 2005
[11]

Cambridge University Press, 2014

Wulfram Gerstner, Werner M Kistler, Richard Naud, and Liam Paninski.Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014

work page 2014
[12]

Long short-term memory and learning-to-learn in networks of spiking neurons.Advances in neural information processing systems, 31, 2018

Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons.Advances in neural information processing systems, 31, 2018

work page 2018
[13]

Simple models including energy and spike constraints reproduce complex activity patterns and metabolic disruptions.PLoS Computational Biology, 16(12):e1008503, 2020

Tanguy Fardet and Anna Levina. Simple models including energy and spike constraints reproduce complex activity patterns and metabolic disruptions.PLoS Computational Biology, 16(12):e1008503, 2020

work page 2020
[14]

Pyramidal neuron as two-layer neural network

Panayiota Poirazi, Terrence Brannon, and Bartlett W Mel. Pyramidal neuron as two-layer neural network. Neuron, 37(6):989–999, 2003

work page 2003
[15]

An augmented two-layer model captures nonlinear analog spatial integration effects in pyramidal neuron dendrites

Monika P Jadi, Bardia F Behabadi, Alon Poleg-Polsky, Jackie Schiller, and Bartlett W Mel. An augmented two-layer model captures nonlinear analog spatial integration effects in pyramidal neuron dendrites. Proceedings of the IEEE, 102(5):782–798, 2014

work page 2014
[16]

Dendritic action potentials and computation in human layer 2/3 cortical neurons.Science, 367(6473):83–87, 2020

Albert Gidon, Timothy Adam Zolnik, Pawel Fidzinski, Felix Bolduan, Athanasia Papoutsi, Panayiota Poirazi, Martin Holtkamp, Imre Vida, and Matthew Evan Larkum. Dendritic action potentials and computation in human layer 2/3 cortical neurons.Science, 367(6473):83–87, 2020

work page 2020
[17]

Global and multiplexed dendritic computations under in vivo-like conditions.Neuron, 100(3):579–592, 2018

Balázs B Ujfalussy, Judit K Makara, Máté Lengyel, and Tiago Branco. Global and multiplexed dendritic computations under in vivo-like conditions.Neuron, 100(3):579–592, 2018

work page 2018
[18]

Dendritic integration: 60 years of progress.Nature neuroscience, 18 (12):1713–1721, 2015

Greg J Stuart and Nelson Spruston. Dendritic integration: 60 years of progress.Nature neuroscience, 18 (12):1713–1721, 2015

work page 2015
[19]

Single cortical neurons as deep artificial neural networks.Neuron, 109(17):2727–2739, 2021

David Beniaguev, Idan Segev, and Michael London. Single cortical neurons as deep artificial neural networks.Neuron, 109(17):2727–2739, 2021

work page 2021
[20]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. 10

work page 1998
[21]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997
[22]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=uYLFoz1vlAC

work page 2022
[23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. doi: 10.48550/arXiv.2001.08361

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001
[24]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[25]

The heidelberg spiking data sets for the systematic evaluation of spiking neural networks.IEEE Transactions on Neural Networks and Learning Systems, 2020

Benjamin Cramer, Yannik Stradmann, Johannes Schemmel, and Friedemann Zenke. The heidelberg spiking data sets for the systematic evaluation of spiking neural networks.IEEE Transactions on Neural Networks and Learning Systems, 2020

work page 2020
[26]

Networks of spiking neurons: the third generation of neural network models.Neural networks, 10(9):1659–1671, 1997

Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models.Neural networks, 10(9):1659–1671, 1997

work page 1997
[27]

echo state network

Herbert Jaeger. Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the" echo state network" approach.., 2002

work page 2002
[28]

Buonomano and Wolfgang Maass

Dean V . Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks.Nature Reviews Neuroscience, 10(2):113–125, 2009. doi: 10.1038/nrn2558

work page doi:10.1038/nrn2558 2009
[29]

Nature , volume=

Valerio Mante, David Sussillo, Krishna V . Shenoy, and William T. Newsome. Context-dependent computa- tion by recurrent dynamics in prefrontal cortex.Nature, 503(7474):78–84, 2013. doi: 10.1038/nature12742

work page doi:10.1038/nature12742 2013
[30]

MIT press, 2005

Peter Dayan and Laurence F Abbott.Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005

work page 2005
[31]

A clockwork RNN

Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. A clockwork RNN. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1863–1871. PMLR, 2014. URL https://proceedings.mlr.press/v32/koutnik14. html

work page 2014
[32]

Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P

Adam Santoro, Ryan Faulkner, David Raposo, Jack W. Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P. Lillicrap. Relational recurrent neural networks. InAdvances in Neural Information Processing Systems, volume 31, pages 7310–7321, 2018. URLhttps://papers.nips.cc/paper/7960-relational-recurrent-neural-networks

work page 2018
[33]

Recurrent independent mechanisms

Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. Recurrent independent mechanisms. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=mLcmdlEUxy-

work page 2021
[34]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum? id=B1ckMDqlg

work page 2017
[35]

Smith, Andrew Warrington, and Scott Linderman

Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=Ai8Hw3AXqks

work page 2023
[36]

Continuous thought machines

Luke Nicholas Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous thought machines. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=y0wDflmpLk

work page 2026
[37]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017. doi: 10.48550/arXiv.1712.00409. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.00409 2017
[38]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

work page internal anchor Pith review doi:10.48550/arxiv 2010
[39]

Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, 39(3):930–945, 1993. doi: 10.1109/18.256500

work page doi:10.1109/18.256500 1993
[40]

, title =

George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989. doi: 10.1007/BF02551274

work page doi:10.1007/bf02551274 1989
[41]

Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio

Guido F. Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InAdvances in Neural Information Processing Systems, volume 27, 2014

work page 2014
[42]

Benefits of depth in neural networks

Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 1517–1539. PMLR, 2016. URL https://proceedings.mlr.press/v49/telgarsky16.html

work page 2016
[43]

On the expressive power of deep neural networks

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2847–2854. PMLR, 2017. URL https: //proceedings.mlr.press/v70/raghu17a.html

work page 2017
[44]

Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[45]

Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In Walter A. Rosenblith, editor,Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961

work page 1961
[46]

Some informational aspects of visual perception.Psychological Review, 61(3):183–193,

Fred Attneave. Some informational aspects of visual perception.Psychological Review, 61(3):183–193,

work page
[47]

doi: 10.1037/h0054663

work page doi:10.1037/h0054663
[48]

Laughlin

Simon B. Laughlin. A simple coding procedure enhances a neuron’s information capacity.Zeitschrift für Naturforschung C, 36(9–10):910–912, 1981. doi: 10.1515/znc-1981-9-1040

work page doi:10.1515/znc-1981-9-1040 1981
[49]

Nature 381, 607–609

Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609, 1996. doi: 10.1038/381607a0

work page doi:10.1038/381607a0 1996
[50]

Pereira, and William Bialek

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, 1999

work page 1999
[51]

Abbott and Peter Dayan

Larry F. Abbott and Peter Dayan. The effect of correlated variability on the accuracy of a population code. Neural Computation, 11(1):91–101, 1999. doi: 10.1162/089976699300016827

work page doi:10.1162/089976699300016827 1999
[52]

Averbeck, Peter E

Bruno B. Averbeck, Peter E. Latham, and Alexandre Pouget. Neural correlations, population coding and computation.Nature Reviews Neuroscience, 7(5):358–366, 2006. doi: 10.1038/nrn1888

work page doi:10.1038/nrn1888 2006
[53]

Information-limiting correlations.Nature Neuroscience, 17(10):1410–1417, 2014

Rubén Moreno-Bote, Jeffrey Beck, Ingmar Kanitscheider, Xaq Pitkow, Peter Latham, and Alexandre Pouget. Information-limiting correlations.Nature Neuroscience, 17(10):1410–1417, 2014. doi: 10.1038/nn.3807

work page doi:10.1038/nn.3807 2014
[54]

Superspike: Supervised learning in multilayer spiking neural networks.Neural computation, 30(6):1514–1541, 2018

Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks.Neural computation, 30(6):1514–1541, 2018

work page 2018
[55]

Large text compression benchmark, 2011

Matt Mahoney. Large text compression benchmark, 2011. URL http://www.mattmahoney.net/dc/ text.html

work page 2011
[56]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019

work page 2019
[57]

Fluctuation-driven initialization for spiking neural network training.Neuromorphic Computing and Engineering, 2(4):044016, 2022

Julian Rossbroich, Julia Gygax, and Friedemann Zenke. Fluctuation-driven initialization for spiking neural network training.Neuromorphic Computing and Engineering, 2(4):044016, 2022

work page 2022
[58]

A surrogate gradient spiking baseline for speech command recognition.Frontiers in Neuroscience, 16:865897, 2022

Alexandre Bittar and Philip N Garner. A surrogate gradient spiking baseline for speech command recognition.Frontiers in Neuroscience, 16:865897, 2022. 12

work page 2022
[59]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Com- pressive transformers for long-range sequence modelling. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SylKikSYDH

work page 2020
[60]

Branch-ELM

Etay Hay, Sean Hill, Felix Schürmann, Henry Markram, and Idan Segev. Models of neocortical layer 5b pyramidal cells capturing a wide range of dendritic and perisomatic active properties.PLoS computational biology, 7(7):e1002107, 2011. 13 A Architecture, Training, Dataset and Analysis Details The accompanying code repository for experimental reproducibilit...

work page 2011
[61]

at https://github

and NeuronIO [ 19], we use the dataloaders provided by Spieler et al. at https://github. com/AaronSpieler/elmneuron, released under the MIT License; the SHD-Adding dataloader in- gests SHD data. Enwik8 [ 54] is available from Matt Mahoney at http://mattmahoney.net/ dc/enwik8.zip; it consists of the first 108 bytes of the March 3, 2006 English Wikipedia du...

work page 2006