pith. machine review for the scientific record. sign in

arxiv: 2605.12049 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.IT· cs.NE· math.IT

Recognition: no theorem link

Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITcs.NEmath.IT
keywords recurrent networksscaling lawsexpressive neuronsparameter tradeoffsinformation theorysequence modelingELM networksneuromorphic benchmarks
0
0 comments X

The pith

Allocating a fixed parameter budget between neuron count, complexity, and connectivity in recurrent networks produces a non-trivial optimum that shifts toward more complex neurons as the budget grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how to divide a fixed parameter budget P among the number of recurrent units N, each unit's effective complexity k_e, and its connectivity k_c. It introduces ELM neurons that allow these three quantities to be tuned independently while training stably over wide ranges of scale. Experiments on two sequence tasks show that performance rises with each dimension separately, yet under a constant total budget a clear tradeoff optimum appears; larger budgets move the optimum toward higher per-neuron complexity. A closed-form information-theoretic model accounts for the observed diminishing returns by attributing them to per-neuron signal-to-noise saturation and population-level redundancy.

Core claim

Under a fixed parameter budget, recurrent networks built from ELM neurons exhibit a non-trivial optimum in the allocation of units, per-unit complexity, and connectivity; larger budgets shift this optimum toward greater per-neuron complexity. The tradeoffs are captured by a closed-form information-theoretic model that attributes diminishing returns to per-neuron signal-to-noise saturation at high complexity and across-neuron redundancy at high connectivity or low complexity.

What carries the argument

The ELM neuron, an Expressive Leaky Memory unit whose design permits independent tuning of effective complexity k_e and connectivity k_c while maintaining stable training across scales.

If this is right

  • Performance increases monotonically when varying N, k_e, or k_c individually.
  • The optimal balance shifts toward higher k_e as total parameters grow.
  • The information-theoretic model predicts the locations of the performance peaks.
  • Sweeps over three orders of magnitude in parameters trace a consistent scaling surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same allocation principles may apply to other sequence architectures beyond recurrent networks.
  • Biological cortical neurons may have evolved their complexity to optimize similar efficiency tradeoffs under resource constraints.
  • New benchmarks could test whether the identified optimum generalizes beyond the SHD-Adding and Enwik8 tasks.

Load-bearing premise

The ELM neuron design permits truly independent control of complexity and connectivity without unintended interactions, and the two sequence benchmarks suffice to establish a general scaling law.

What would settle it

A direct test would be to measure whether performance on additional sequence modeling tasks continues to favor increasingly complex neurons as the total parameter count rises beyond the range explored, or whether the predicted information-theoretic curves match observed error rates when k_e or k_c are varied independently.

Figures

Figures reproduced from arXiv: 2605.12049 by Aaron Spieler, Anna Levina, Georg Martius.

Figure 1
Figure 1. Figure 1: Complexity of cortical neurons motivates search for the optimal complexity of a unit in the parameter-constrained recurrent networks. a) Cortical neurons combine complex dendritic structure with rich internal dynamics, making them powerful spatio-temporal processing units. b) Using ELM neurons [1], from simple integrators with two memory units and a few parameters ksimple to large models exceeding the comp… view at source ↗
Figure 2
Figure 2. Figure 2: A stable and flexible model system for studying scaling and tradeoffs in recurrent networks of expressive neurons. a) The modified Expressive Leaky Memory (ELM) neuron. b) ELM neurons are assembled as ELM Network, a doubly recurrent sequence model whose computational core is a wide recurrent hidden layer in which each neuron itself is recurrent; followed by a smaller readout layer and output projection. c)… view at source ↗
Figure 3
Figure 3. Figure 3: More neurons or more complex neurons each improve performance on the SHD￾Adding task; under a fixed budget, a clear non-trivial optimum emerges. Reference model (triangle) was chosen to be below saturation along all dimensions. Test accuracy improves with a) the number of neurons Nrec and b) neuron complexity ke ∼ d 2 m with dmlp = 2dm. c) Under fixed parameter budget, a clear optimum emerges in the number… view at source ↗
Figure 4
Figure 4. Figure 4: Monotonic scaling gains extend across vast network sizes; larger budgets favor both more and more complex neurons, and connectivity introduces new tradeoff dimensions. Enwik8: character-level language modeling (test BPC, lower is better) [54]. Reference model marked with a triangle. Exemplary network activity in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Effective Representation Information derived from viewing neural layers as noisy channels is motivated by residual-error scaling in single neurons. a) Neural activations may be viewed as a noisy multi-channel representation of their inputs, with Irep quantifying how much task￾relevant information they transmit. b) Per-neuron complexity determines the residual variance σ 2 n (ke), a decreasing function … view at source ↗
Figure 6
Figure 6. Figure 6: The theoretical model qualitatively reproduces the empirical scaling tradeoffs and links their shifts to measurable architectural quantities. a–f) Enwik8 experiments (top, test BPC) are paired with corresponding theory ablations (bottom, −Irep) obtained via joint fit with shared reference parameters (Pearson r = 0.98). Theory optima marked with a golden star. In all panels, reference model ds/50 = 15, τmax… view at source ↗
Figure 7
Figure 7. Figure 7: A simple joint-scaling heuristic consistent with theory traces the empirical Pareto frontier across both datasets and three orders of magnitude. We perform a structured large￾scale hyper-parameter search including dmlp, dtree and dbranch. Mean across three runs reported on SHD-Adding and single run performance on Enwik8. a,b) On both datasets networks with more and more complex neurons become optimal as bu… view at source ↗
Figure 8
Figure 8. Figure 8: Training drives ELM Networks into a sparse, mostly asynchronous, irregular activity regime. Example inference on Enwik8 of a reference model configuration with N = 1024 and dm = 15. a, b) Individual neurons’ activity is characterized by brief spike like above-threshold activations, and high-frequency sub-threshold fluctuations. At the population level, ∼10% of neurons are active at any given time, firing a… view at source ↗
Figure 9
Figure 9. Figure 9: Proportional connectivity and deeper dendritic integration scale better even accounting for additional parameter cost: ablations of number of neurons and neuron complexity on Enwik8 matching Figure 4a,b, with x-axis plotting in terms of total network trainable parameters. While curves rescale, the same trend emerges; a) scaling with number of neurons works better with proportional neuron connectivity, and … view at source ↗
Figure 10
Figure 10. Figure 10: ELM Network training is smooth and gradients remain stable throughout: Training dynamics of the reference run with N=1024 and dm=15 in Figures 6. a) Train and valid BPC over 750 training turns, converging near 1.644 valid BPC slightly below test BPC of 1.65. b) Min and max parameter gradient norms, logged every 50th gradient step for 5000 samples, remain stable throughout training with no signs of explodi… view at source ↗
Figure 11
Figure 11. Figure 11: A wide range of high-pass filter timescales stabilize training and performance: Ablation of τr on Enwik8 for the reference architecture with N=1024 and dm=15. Training remains stable and yields similar performance over an order of magnitude in τr. Training runs with too large τr become unstable, ones with too small timescale remove all signal from neuron output. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_… view at source ↗
Figure 12
Figure 12. Figure 12: The qualitative trade-off between number of neurons and complexity also persists for feed-forward networks: We evaluate the N vs ke tradeoff for purely feed-forward ELM-Network with ρrec = 0.0 on Enwik8. Note that individual ELM Neurons remain internally recurrent. We likewise observe the emergence of nontrivial optima where network configurations with intermediate ELM Neuron complexity perform best, that… view at source ↗
Figure 13
Figure 13. Figure 13: The ELM-Network benefits from L2 neuron MLP output and L1 network activity regularization, without changing the qualitative tradeoffs: Regularizer setup described in Ap￾pendix A.2. a) Compared to the default regularizer setup (blue curve), disabling the L1 regularizer on network activity (orange curve) slightly degrades performance for larger networks. Additionally disabling the L2 regularizer on the neur… view at source ↗
Figure 14
Figure 14. Figure 14: The ELM-NET architecture exhibits rich temporal dynamics, characterized by periods of asynchronous irregular firing, synchronized bursts, and network silence: An example network inference of an ELM Network with 128 neurons on SHD-Adding (see Appendix A.2). The network’s hidden layer displays short spike or burst like activity, with more active network phases visibly correlating to high input activity peri… view at source ↗
Figure 15
Figure 15. Figure 15: Connectivity introduces the same qualitative tradeoffs on SHD-Adding as on Enwik8: Estimated dataset noise floor at 88% marked with dashed line. a) Increasing synapses per branch dbranch improves performance with diminishing returns. b) An optimal recurrent fraction ρrec ≈ 0.30 exists, which is roughly proportional to the ratio of recurrent to recurrent plus input connections. c) The tradeoff between neur… view at source ↗
Figure 16
Figure 16. Figure 16: Joint theory-experiment fit across seven experiments. Each point is one (Nrec, dm) configuration; the seven experiments span three parameter budgets and two ablation pairs targeting α (τm,max, lmlp) and β (dbranch). Eight theory parameters, the four shared reference values plus per￾experiment variants for each ablated quantity, and a single affine map are fit jointly to all 103 points by minimizing the su… view at source ↗
Figure 17
Figure 17. Figure 17: A single neuron’s reduction of approximation error is well described as a power law in parameter count. Various sized ELM Neuron fit to NeuronIO with reducible MAE reported as mean over three seeds. Decay curves fitted in log-log. a) The reduction in error is compatible with a power law decay across two orders of magnitude in per-neuron parameters ke, strongly preferred over exponential-decay alternatives… view at source ↗
Figure 18
Figure 18. Figure 18: The ELM neuron voltage prediction residuals on NeuronIO have some state and temporal dependence: Residuals for a neuron model with dm = 3 fitted on the NeuronIO dataset containing single neuron voltage recordings. Note that the underlying target membrane voltage data itself displays multiple operating regimes, with particularly violent dynamics towards spiking threshold (≈ −60mV ). a) Voltage-prediction r… view at source ↗
Figure 19
Figure 19. Figure 19: Power-law structure in the ELM neurons’ output. Function fitting performed in log-log. Individual neuron output measured at memory readout w T mt as it still contains the task-relevant slow signal components. Eigenvalues computed across 50 distinct recordings of 512 steps, after discarding 128-step burn-in. Recordings from the reference model with Nrec = 1024 on Enwik8. a) The neuron’s output covariance e… view at source ↗
Figure 20
Figure 20. Figure 20: Empirical neuron complexity scaling on Enwik8 independently validates the max noise floor assumption: Reducible test BPC vs. per-neuron effective parameter count ke, with layer width kept fixed at N = 1024 but increasing layer budgets. Functions fitted in log-log, with floor function slopes seeded with pure power law fit, and floors seeded with last evaluation point. Model comparison uses the corrected Ak… view at source ↗
Figure 21
Figure 21. Figure 21: Gaussianity of the readout w ⊤ r mt of neurons on Enwik8. Only y = f +n is observable, so Gaussianity is tested on y directly. Data plotted from 50 distinct recordings of 512 steps, after discarding 128-step burn-in. Recordings from the reference model with Nrec = 1024 on Enwik8. a) Pooled marginal of per-neuron z-scored activity vs. N (0, 1), shape-only. b) Per-neuron (skew, excess kurtosis) with Gaussia… view at source ↗
read the original abstract

Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Expressive Leaky Memory (ELM) neurons for recurrent networks, allowing independent tuning of neuron count N, per-neuron effective complexity k_e, and per-neuron connectivity k_c within a fixed parameter budget P. On SHD-Adding and Enwik8 benchmarks, performance improves monotonically along each axis separately; under fixed P a non-monotonic optimum appears, with larger budgets favoring both higher N and higher k_e. A closed-form information-theoretic model is derived to explain the observed tradeoffs, attributing diminishing returns to per-neuron signal-to-noise saturation at high k_e and across-neuron redundancy at high N.

Significance. If the claimed independence of k_e and k_c holds and the closed-form model is predictive rather than post-hoc, the work supplies a concrete scaling-law framework that challenges the default use of simple units in recurrent architectures and offers a normative account of why biological circuits employ complex spatio-temporal integrators. The three-axis sweep spanning three orders of magnitude in P and the explicit attribution of the two saturation regimes are the strongest contributions.

major comments (3)
  1. [ELM neuron definition and hyperparameter sweep] The central claim that N, k_e, and k_c can be varied independently rests on the ELM neuron definition. The leaky-integrator dynamics and recurrent connectivity matrix share the same hidden state; any normalization, initialization, or gradient flow through the leak parameter could induce statistical dependence between effective complexity and effective connectivity. Without an explicit orthogonality test (e.g., ablation of leak time-constant while holding connectivity matrix statistics fixed, or measurement of mutual information between the two axes), the three-axis sweep does not necessarily trace an orthogonal tradeoff surface, rendering the closed-form model and the conclusion that “larger budgets favor both more and more complex neurons” potentially post-hoc.
  2. [Closed-form information-theoretic model] The information-theoretic model is presented as capturing the observed tradeoffs. If its free parameters (signal-to-noise saturation threshold, redundancy coefficient) are fitted to the same hyperparameter sweeps they are meant to explain, the derivation reduces to a descriptive fit rather than a predictive, parameter-free account. The manuscript should report whether the model parameters were fixed a priori from information-theoretic considerations or optimized on the same data.
  3. [Experimental evaluation] The two benchmarks (SHD-Adding and Enwik8) are qualitatively different, yet the scaling-law claim is stated generally. No cross-benchmark consistency check or additional task (e.g., a long-range dependency or continuous-control task) is reported to establish that the non-monotonic optimum and the two saturation regimes are not benchmark-specific.
minor comments (2)
  1. Notation for k_e and k_c should be defined once at first use and used consistently; the abstract and main text occasionally switch between “effective complexity” and “per-unit complexity” without explicit mapping.
  2. Figure captions for the scaling-law plots should include the exact ranges of N, k_e, and k_c explored and the total parameter count P for each point.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each of the major comments below and have updated the manuscript accordingly to strengthen the claims regarding parameter independence and model derivation.

read point-by-point responses
  1. Referee: [ELM neuron definition and hyperparameter sweep] The central claim that N, k_e, and k_c can be varied independently rests on the ELM neuron definition. The leaky-integrator dynamics and recurrent connectivity matrix share the same hidden state; any normalization, initialization, or gradient flow through the leak parameter could induce statistical dependence between effective complexity and effective connectivity. Without an explicit orthogonality test (e.g., ablation of leak time-constant while holding connectivity matrix statistics fixed, or measurement of mutual information between the two axes), the three-axis sweep does not necessarily trace an orthogonal tradeoff surface, rendering the closed-form model and the conclusion that “larger budgets favor both more and more complex neurons” potentially post-hoc.

    Authors: The ELM neuron parameterization explicitly separates the leak time constants, which determine the per-neuron effective complexity k_e through multi-timescale integration, from the recurrent connectivity matrix that sets k_c. By construction, these are controlled by distinct sets of parameters, and the hyperparameter sweeps were designed to vary them independently while keeping the total parameter count P fixed. To directly address the potential for induced dependence, we have added an orthogonality analysis in the revised manuscript, including an ablation where leak parameters are varied while holding connectivity statistics fixed, confirming that the effective axes remain largely orthogonal. This supports the validity of the three-axis sweep and the scaling conclusions. revision: yes

  2. Referee: [Closed-form information-theoretic model] The information-theoretic model is presented as capturing the observed tradeoffs. If its free parameters (signal-to-noise saturation threshold, redundancy coefficient) are fitted to the same hyperparameter sweeps they are meant to explain, the derivation reduces to a descriptive fit rather than a predictive, parameter-free account. The manuscript should report whether the model parameters were fixed a priori from information-theoretic considerations or optimized on the same data.

    Authors: The parameters in the closed-form model, including the signal-to-noise saturation threshold and redundancy coefficient, were determined a priori based on information-theoretic bounds on neuron capacity and redundancy in recurrent networks, derived from standard results on mutual information in noisy channels and population coding. They were not optimized on the experimental data. We have revised the manuscript to explicitly state the a priori derivation and the specific values used, along with a sensitivity analysis showing robustness. revision: yes

  3. Referee: [Experimental evaluation] The two benchmarks (SHD-Adding and Enwik8) are qualitatively different, yet the scaling-law claim is stated generally. No cross-benchmark consistency check or additional task (e.g., a long-range dependency or continuous-control task) is reported to establish that the non-monotonic optimum and the two saturation regimes are not benchmark-specific.

    Authors: The SHD-Adding and Enwik8 tasks were selected precisely because they differ substantially in input modality, temporal structure, and task demands—one being a neuromorphic spike-based addition task and the other a character-level language modeling benchmark. The observed scaling laws, including the non-monotonic optimum under fixed P and the two saturation regimes, are consistent across both, as detailed in the results section. While we acknowledge that further validation on additional tasks such as long-range dependency benchmarks would be valuable, the qualitative differences between the current pair provide support for the generality of the framework. We have expanded the discussion section to include a cross-benchmark consistency analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the ELM neuron design to enable independent variation of N, k_e and k_c, reports monotonic improvements along each axis, identifies a non-monotonic optimum under fixed budget, and presents a closed-form information-theoretic model that attributes diminishing returns to signal-to-noise saturation and redundancy. No equation or description in the abstract or provided text shows that the closed-form model is obtained by fitting parameters to the same hyperparameter sweeps it explains, nor does any step reduce a claimed prediction to its inputs by construction. The scaling-law trace is described as consistent with the framework rather than derived from it tautologically. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that ELM neurons provide independent tunable complexity and that the information-theoretic model accurately attributes the observed diminishing returns without circular fitting.

free parameters (2)
  • per-unit effective complexity k_e
    Tuned independently as one of the three allocation axes in the fixed-budget experiments.
  • per-unit connectivity k_c
    Tuned independently as one of the three allocation axes in the fixed-budget experiments.
axioms (1)
  • domain assumption ELM neurons mirror functional components of cortical neurons
    Explicitly stated as the design choice for the recurrent layer.
invented entities (1)
  • Expressive Leaky Memory (ELM) neuron no independent evidence
    purpose: To allow independent adjustment of per-unit complexity k_e separately from width N and connectivity k_c
    Newly introduced architecture component whose properties are not derived from prior literature in the abstract.

pith-pipeline@v0.9.0 · 5627 in / 1408 out tokens · 39389 ms · 2026-05-13T06:23:59.185813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

  1. [1]

    The expressive leaky memory neuron: an efficient and expressive phenomenological neuron model can solve long- horizon tasks

    Aaron Spieler, Nasim Rahaman, Georg Martius, Bernhard Schölkopf, and Anna Levina. The expressive leaky memory neuron: an efficient and expressive phenomenological neuron model can solve long- horizon tasks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vE1e1mLJ0U

  2. [2]

    Are dendrites conceptually useful?Neuroscience, 2022

    Matthew Larkum. Are dendrites conceptually useful?Neuroscience, 2022

  3. [3]

    What makes human cortical pyramidal neurons functionally complex.bioRxiv, pages 2024–12, 2024

    Ido Aizenbud, Daniela Yoeli, David Beniaguev, Christiaan PJ de Kock, Michael London, and Idan Segev. What makes human cortical pyramidal neurons functionally complex.bioRxiv, pages 2024–12, 2024

  4. [4]

    Illuminating dendritic function with computational models

    Panayiota Poirazi and Athanasia Papoutsi. Illuminating dendritic function with computational models. Nature Reviews Neuroscience, 21(6):303–321, 2020

  5. [5]

    Springer Singapore, Singapore, 2019

    Snehashish Chakraverty, Deepti Moyi Sahoo, and Nisha Rani Mahato.McCulloch–Pitts Neural Network Model, pages 167–173. Springer Singapore, Singapore, 2019. ISBN 978-981-13-7430-2. doi: 10.1007/ 978-981-13-7430-2_11. URLhttps://doi.org/10.1007/978-981-13-7430-2_11

  6. [6]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  7. [7]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  8. [8]

    xlstm: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. 37, 2024

  9. [9]

    Which model to use for cortical spiking neurons?IEEE transactions on neural networks, 15(5):1063–1070, 2004

    Eugene M Izhikevich. Which model to use for cortical spiking neurons?IEEE transactions on neural networks, 15(5):1063–1070, 2004

  10. [10]

    Adaptive exponential integrate-and-fire model as an effective description of neuronal activity.Journal of neurophysiology, 94(5):3637–3642, 2005

    Romain Brette and Wulfram Gerstner. Adaptive exponential integrate-and-fire model as an effective description of neuronal activity.Journal of neurophysiology, 94(5):3637–3642, 2005

  11. [11]

    Cambridge University Press, 2014

    Wulfram Gerstner, Werner M Kistler, Richard Naud, and Liam Paninski.Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014

  12. [12]

    Long short-term memory and learning-to-learn in networks of spiking neurons.Advances in neural information processing systems, 31, 2018

    Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons.Advances in neural information processing systems, 31, 2018

  13. [13]

    Simple models including energy and spike constraints reproduce complex activity patterns and metabolic disruptions.PLoS Computational Biology, 16(12):e1008503, 2020

    Tanguy Fardet and Anna Levina. Simple models including energy and spike constraints reproduce complex activity patterns and metabolic disruptions.PLoS Computational Biology, 16(12):e1008503, 2020

  14. [14]

    Pyramidal neuron as two-layer neural network

    Panayiota Poirazi, Terrence Brannon, and Bartlett W Mel. Pyramidal neuron as two-layer neural network. Neuron, 37(6):989–999, 2003

  15. [15]

    An augmented two-layer model captures nonlinear analog spatial integration effects in pyramidal neuron dendrites

    Monika P Jadi, Bardia F Behabadi, Alon Poleg-Polsky, Jackie Schiller, and Bartlett W Mel. An augmented two-layer model captures nonlinear analog spatial integration effects in pyramidal neuron dendrites. Proceedings of the IEEE, 102(5):782–798, 2014

  16. [16]

    Dendritic action potentials and computation in human layer 2/3 cortical neurons.Science, 367(6473):83–87, 2020

    Albert Gidon, Timothy Adam Zolnik, Pawel Fidzinski, Felix Bolduan, Athanasia Papoutsi, Panayiota Poirazi, Martin Holtkamp, Imre Vida, and Matthew Evan Larkum. Dendritic action potentials and computation in human layer 2/3 cortical neurons.Science, 367(6473):83–87, 2020

  17. [17]

    Global and multiplexed dendritic computations under in vivo-like conditions.Neuron, 100(3):579–592, 2018

    Balázs B Ujfalussy, Judit K Makara, Máté Lengyel, and Tiago Branco. Global and multiplexed dendritic computations under in vivo-like conditions.Neuron, 100(3):579–592, 2018

  18. [18]

    Dendritic integration: 60 years of progress.Nature neuroscience, 18 (12):1713–1721, 2015

    Greg J Stuart and Nelson Spruston. Dendritic integration: 60 years of progress.Nature neuroscience, 18 (12):1713–1721, 2015

  19. [19]

    Single cortical neurons as deep artificial neural networks.Neuron, 109(17):2727–2739, 2021

    David Beniaguev, Idan Segev, and Michael London. Single cortical neurons as deep artificial neural networks.Neuron, 109(17):2727–2739, 2021

  20. [20]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. 10

  21. [21]

    Long short-term memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

  22. [22]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=uYLFoz1vlAC

  23. [23]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. doi: 10.48550/arXiv.2001.08361

  24. [24]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  25. [25]

    The heidelberg spiking data sets for the systematic evaluation of spiking neural networks.IEEE Transactions on Neural Networks and Learning Systems, 2020

    Benjamin Cramer, Yannik Stradmann, Johannes Schemmel, and Friedemann Zenke. The heidelberg spiking data sets for the systematic evaluation of spiking neural networks.IEEE Transactions on Neural Networks and Learning Systems, 2020

  26. [26]

    Networks of spiking neurons: the third generation of neural network models.Neural networks, 10(9):1659–1671, 1997

    Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models.Neural networks, 10(9):1659–1671, 1997

  27. [27]

    echo state network

    Herbert Jaeger. Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the" echo state network" approach.., 2002

  28. [28]

    Buonomano and Wolfgang Maass

    Dean V . Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks.Nature Reviews Neuroscience, 10(2):113–125, 2009. doi: 10.1038/nrn2558

  29. [29]

    Nature , volume=

    Valerio Mante, David Sussillo, Krishna V . Shenoy, and William T. Newsome. Context-dependent computa- tion by recurrent dynamics in prefrontal cortex.Nature, 503(7474):78–84, 2013. doi: 10.1038/nature12742

  30. [30]

    MIT press, 2005

    Peter Dayan and Laurence F Abbott.Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005

  31. [31]

    A clockwork RNN

    Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. A clockwork RNN. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1863–1871. PMLR, 2014. URL https://proceedings.mlr.press/v32/koutnik14. html

  32. [32]

    Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P

    Adam Santoro, Ryan Faulkner, David Raposo, Jack W. Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy P. Lillicrap. Relational recurrent neural networks. InAdvances in Neural Information Processing Systems, volume 31, pages 7310–7321, 2018. URLhttps://papers.nips.cc/paper/7960-relational-recurrent-neural-networks

  33. [33]

    Recurrent independent mechanisms

    Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. Recurrent independent mechanisms. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=mLcmdlEUxy-

  34. [34]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum? id=B1ckMDqlg

  35. [35]

    Smith, Andrew Warrington, and Scott Linderman

    Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=Ai8Hw3AXqks

  36. [36]

    Continuous thought machines

    Luke Nicholas Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. Continuous thought machines. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=y0wDflmpLk

  37. [37]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017. doi: 10.48550/arXiv.1712.00409. 11

  38. [38]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

  39. [39]

    Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, 39(3):930–945, 1993. doi: 10.1109/18.256500

  40. [40]

    , title =

    George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989. doi: 10.1007/BF02551274

  41. [41]

    Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio

    Guido F. Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InAdvances in Neural Information Processing Systems, volume 27, 2014

  42. [42]

    Benefits of depth in neural networks

    Matus Telgarsky. Benefits of depth in neural networks. InProceedings of the 29th Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 1517–1539. PMLR, 2016. URL https://proceedings.mlr.press/v49/telgarsky16.html

  43. [43]

    On the expressive power of deep neural networks

    Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2847–2854. PMLR, 2017. URL https: //proceedings.mlr.press/v70/raghu17a.html

  44. [44]

    Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

  45. [45]

    Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In Walter A. Rosenblith, editor,Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961

  46. [46]

    Some informational aspects of visual perception.Psychological Review, 61(3):183–193,

    Fred Attneave. Some informational aspects of visual perception.Psychological Review, 61(3):183–193,

  47. [47]

    doi: 10.1037/h0054663

  48. [48]

    Laughlin

    Simon B. Laughlin. A simple coding procedure enhances a neuron’s information capacity.Zeitschrift für Naturforschung C, 36(9–10):910–912, 1981. doi: 10.1515/znc-1981-9-1040

  49. [49]

    Nature 381, 607–609

    Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature, 381(6583):607–609, 1996. doi: 10.1038/381607a0

  50. [50]

    Pereira, and William Bialek

    Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pages 368–377, 1999

  51. [51]

    Abbott and Peter Dayan

    Larry F. Abbott and Peter Dayan. The effect of correlated variability on the accuracy of a population code. Neural Computation, 11(1):91–101, 1999. doi: 10.1162/089976699300016827

  52. [52]

    Averbeck, Peter E

    Bruno B. Averbeck, Peter E. Latham, and Alexandre Pouget. Neural correlations, population coding and computation.Nature Reviews Neuroscience, 7(5):358–366, 2006. doi: 10.1038/nrn1888

  53. [53]

    Information-limiting correlations.Nature Neuroscience, 17(10):1410–1417, 2014

    Rubén Moreno-Bote, Jeffrey Beck, Ingmar Kanitscheider, Xaq Pitkow, Peter Latham, and Alexandre Pouget. Information-limiting correlations.Nature Neuroscience, 17(10):1410–1417, 2014. doi: 10.1038/nn.3807

  54. [54]

    Superspike: Supervised learning in multilayer spiking neural networks.Neural computation, 30(6):1514–1541, 2018

    Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks.Neural computation, 30(6):1514–1541, 2018

  55. [55]

    Large text compression benchmark, 2011

    Matt Mahoney. Large text compression benchmark, 2011. URL http://www.mattmahoney.net/dc/ text.html

  56. [56]

    Transformer-xl: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019

  57. [57]

    Fluctuation-driven initialization for spiking neural network training.Neuromorphic Computing and Engineering, 2(4):044016, 2022

    Julian Rossbroich, Julia Gygax, and Friedemann Zenke. Fluctuation-driven initialization for spiking neural network training.Neuromorphic Computing and Engineering, 2(4):044016, 2022

  58. [58]

    A surrogate gradient spiking baseline for speech command recognition.Frontiers in Neuroscience, 16:865897, 2022

    Alexandre Bittar and Philip N Garner. A surrogate gradient spiking baseline for speech command recognition.Frontiers in Neuroscience, 16:865897, 2022. 12

  59. [59]

    Rae, Anna Potapenko, Siddhant M

    Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Com- pressive transformers for long-range sequence modelling. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SylKikSYDH

  60. [60]

    Branch-ELM

    Etay Hay, Sean Hill, Felix Schürmann, Henry Markram, and Idan Segev. Models of neocortical layer 5b pyramidal cells capturing a wide range of dendritic and perisomatic active properties.PLoS computational biology, 7(7):e1002107, 2011. 13 A Architecture, Training, Dataset and Analysis Details The accompanying code repository for experimental reproducibilit...

  61. [61]

    at https://github

    and NeuronIO [ 19], we use the dataloaders provided by Spieler et al. at https://github. com/AaronSpieler/elmneuron, released under the MIT License; the SHD-Adding dataloader in- gests SHD data. Enwik8 [ 54] is available from Matt Mahoney at http://mattmahoney.net/ dc/enwik8.zip; it consists of the first 108 bytes of the March 3, 2006 English Wikipedia du...