Growing a Neural Network in Breadth, Depth, and Time

Eivinas Butkus; Kedar Garz\'on Gupta; Nikolaus Kriegeskorte

arxiv: 2605.25174 · v1 · pith:LPKXKKYZnew · submitted 2026-05-24 · 🧬 q-bio.NC · cs.LG· cs.NE

Growing a Neural Network in Breadth, Depth, and Time

Eivinas Butkus , Kedar Garz\'on Gupta , Nikolaus Kriegeskorte This is my paper

Pith reviewed 2026-06-29 23:20 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.LGcs.NE

keywords recurrent convolutional networksresource constraintsbreadth depth timedifferentiable costsnetwork growthobject recognitionhuman reaction timescomputational graphs

0 comments

The pith

Recurrent convolutional networks learn to trade off breadth, depth, and time when these resources are penalized with differentiable costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce differentiable costs for network breadth, depth, and processing time in a recurrent convolutional neural network modeled as a finite subset of an infinite lattice. Optimizing these costs together with task performance via backpropagation produces networks that can substitute any one resource for the others to reach a given accuracy. With increasing task complexity, the networks expand in all three dimensions, and they use more recurrent steps for partially occluded inputs. The computation time required by the model also correlates with human reaction times during object recognition. This framework provides a normative model for how resource constraints influence the emergence of neural architectures.

Core claim

Jointly optimizing task error with costs on breadth, depth, and time causes diverse computational graphs to emerge, with all three resources trading off against each other, networks growing in every dimension as tasks become harder, increased recurrence under occlusion, and model time correlating with human reaction times.

What carries the argument

Differentiable cost terms for breadth, depth, and time, optimized jointly with task errors via backpropagation within a recurrent convolutional network conceived as a finite subset of an infinite lattice.

If this is right

Networks grow in breadth, depth, and time as task complexity increases.
All three resources can be traded off to achieve a target accuracy.
More recurrent steps are taken spontaneously when inputs are occluded.
Time used by the model correlates with human reaction times in object recognition.
Diverse computational graphs emerge under varying pressures on the three resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mechanisms might explain variations in brain size and recurrence across different species or cognitive demands.
The approach could be used to design artificial networks that balance efficiency across multiple resource dimensions.
Applying the method to other sensory modalities or tasks might test whether the human reaction time correlation generalizes.
The infinite lattice conception allows for continuous rather than discrete network scaling in theory.

Load-bearing premise

The specific differentiable costs chosen for breadth, depth, and time, along with the recurrent convolutional architecture as a finite subset of an infinite lattice, faithfully capture the resource constraints that shape neural computation.

What would settle it

A failure to observe resource trade-offs, lack of growth in all three dimensions with task complexity, or absence of correlation between model time and human reaction times on the object recognition task would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.25174 by Eivinas Butkus, Kedar Garz\'on Gupta, Nikolaus Kriegeskorte.

**Figure 1.** Figure 1: a The space of possible computational graphs can be conceptualized as an infinite lattice, extending in the space of resource use. Here we consider breadth, depth, and time. b Each model instance is a finite subset of the infinite lattice with its own profile of resource use. Our framework lets the network select its own position in this space by optimizing differentiable resource costs. We implement the l… view at source ↗

**Figure 2.** Figure 2: Model architecture. The network implements a finite subset of the infinite lattice ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Breadth vs. depth. a Raw costs decrease smoothly with increasing λbreadth and λdepth. b Average weight magnitudes across layers and channels for each λ combination, with pruned model boundaries shown in red (preserving 98% above-chance accuracy). Top right of each panel shows accuracy before → after pruning. Shallow-and-wide models (top left) can achieve comparable accuracy to narrow-and-deep models (botto… view at source ↗

**Figure 4.** Figure 4: Time. a λtime vs. time cost Ltime. b λtime vs. time used. c Time used vs. accuracy: adaptive time selection dominates fixed. d Occlusion introduced at test time increases time used, demonstrating that the model adaptively chooses how long to compute. e–h Adaptive model behavior averaged across all λtime > 0. e Easy and hard images for several categories, defined by model time used. Model spends more time o… view at source ↗

**Figure 5.** Figure 5: Breadth vs. depth vs. time. a Accuracy as a function of λbreadth and λdepth for increasing λtime (left to right). b Pareto-optimal models (red) that achieve ≥70% accuracy while minimizing breadth, depth, and time used, shown in 3D resource space. c Pairwise 2D projections of the Pareto set. Red points spread across all projections, indicating that breadth, depth, and time are fungible. d Error consistency … view at source ↗

**Figure 6.** Figure 6: Task complexity. a Weight magnitude maps across layers and channels for MNIST, CIFAR10, and Tiny ImageNet under matched resource pressures (λtime = 0.1, single model instance shown per panel). Networks grow in breadth and depth as the task becomes more complex. b Resources used (channels, layers, time) as a function of resource pressure for each dataset. CIFAR-10 and Tiny ImageNet use more spatial resourc… view at source ↗

**Figure 7.** Figure 7: Attribution map entropy as a function of accuracy and number of layers used. At matched [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Spatial and temporal resource constraints are critical for both biological and artificial intelligent systems. Here we define differentiable cost terms for breadth, depth, and time within a recurrent convolutional neural network conceived as a finite subset of an infinite lattice. We optimize these costs jointly with task errors via backpropagation. We set different pressures on breadth, depth, and time, which leads to diverse computational graphs emerging organically through training. We find that all three resources can be traded off against each other to achieve a given level of accuracy. Networks grow in all three dimensions with task complexity and spontaneously take more recurrent steps when inputs are occluded. Surprisingly, time used by the model correlates with human reaction times in an object recognition task. Our framework provides a normative account of how resource constraints shape neural architectures, connecting to questions about brain design in neuroscience, and may help illuminate the diversity of neural solutions found in nature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a joint differentiable optimization over breadth, depth, and time in one lattice RNN that produces trade-offs and a human RT correlation, but the cost functions are chosen rather than derived and the full results are still needed to judge robustness.

read the letter

The main thing here is that they put breadth, depth, and time costs into one recurrent conv net on a lattice and optimize all three together with the task loss. Different pressure settings produce different graphs, networks expand in all three dimensions as tasks get harder, they take extra steps on occluded inputs, and the number of steps correlates with human reaction times on object recognition.

That unified setup is the real step forward. Most prior work handles depth or recurrence separately; doing all three at once inside the same differentiable framework is cleaner and lets the trade-offs appear without hand-designed schedules.

The soft spot is the cost terms. They are picked to be differentiable and to yield the behaviors the authors want, not taken from measured biological quantities like actual wiring length or ATP use. Change the exponents or the normalization and the reported growth patterns and the human correlation could shift or vanish. The abstract gives no equations, no training details, and no stats on the correlation, so it is hard to tell how much the lambdas were adjusted to produce the results or how stable the findings are under different random seeds.

This is worth a serious referee for computational neuroscience and efficient-ML groups. Readers who care about normative accounts of architecture will want to see whether the joint optimization survives when the costs are replaced with more biologically grounded penalties. The idea is coherent enough that it should go out for review rather than get desk-rejected.

Referee Report

2 major / 1 minor

Summary. The paper defines differentiable cost terms for breadth, depth, and time within a recurrent convolutional network treated as a finite subset of an infinite lattice. These costs are optimized jointly with task loss via backpropagation under varying resource pressures, producing emergent computational graphs. Key findings include trade-offs among the three resources for fixed accuracy, growth in all dimensions with task complexity, increased recurrent steps under occlusion, and a correlation between model time usage and human reaction times in object recognition.

Significance. If the results hold under scrutiny of the cost functions and controls, the work supplies a normative model linking resource constraints to architectural diversity, with direct relevance to questions of brain design in neuroscience. The joint optimization and spontaneous emergence of dynamics (e.g., extra recurrent steps) are notable strengths when the cost terms can be independently justified.

major comments (2)

[Abstract] Abstract: the assertion that the framework supplies a 'normative account of how resource constraints shape neural architectures' is load-bearing for the central claim, yet the differentiable cost terms for breadth, depth, and time are selected for differentiability rather than derived from measured biological quantities (e.g., wiring length or metabolic rate); alternative exponents or normalizations could eliminate the reported trade-offs and spontaneous dynamics.
[Abstract] The resource pressure coefficients (lambdas) are free parameters; without an independent justification or sensitivity analysis showing that the growth patterns and human-RT correlation survive changes in functional form, the normative interpretation risks circularity.

minor comments (1)

[Abstract] Abstract: the reported correlation between model time and human reaction times lacks any mention of statistical test, effect size, number of participants, or controls for task difficulty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our normative framing. We address the two major points below and will revise the manuscript accordingly to strengthen the presentation of our cost functions and their justification.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the framework supplies a 'normative account of how resource constraints shape neural architectures' is load-bearing for the central claim, yet the differentiable cost terms for breadth, depth, and time are selected for differentiability rather than derived from measured biological quantities (e.g., wiring length or metabolic rate); alternative exponents or normalizations could eliminate the reported trade-offs and spontaneous dynamics.

Authors: We agree that the cost terms (linear penalties on breadth and depth, and a step-count penalty on time) were selected to be differentiable so that resource usage can be optimized jointly with task loss via backpropagation. The normative claim refers to the principle that explicit, optimizable resource constraints can produce emergent architectural diversity and dynamics, rather than to the claim that our exact functional forms match measured biological quantities. We will revise the abstract to qualify the normative language and add a dedicated paragraph in the discussion that (i) states the rationale for the chosen forms as tractable approximations and (ii) reports new sensitivity analyses on exponents and normalizations. These analyses will test whether the reported trade-offs, growth patterns, and spontaneous recurrence survive changes in functional form. revision: yes
Referee: [Abstract] The resource pressure coefficients (lambdas) are free parameters; without an independent justification or sensitivity analysis showing that the growth patterns and human-RT correlation survive changes in functional form, the normative interpretation risks circularity.

Authors: The lambdas are hyperparameters that set the relative strength of each resource cost. Their specific values were selected so that networks reach high accuracy while still exhibiting measurable resource usage. We acknowledge that independent biological justification for particular lambda values is not provided. We will add a sensitivity section that sweeps lambda values over an order-of-magnitude range and tests alternative cost functional forms; we will show that the core phenomena—resource trade-offs for fixed accuracy, growth in all three dimensions with task complexity, increased recurrence under occlusion, and the model-time/human-RT correlation—remain qualitatively intact. This analysis will be reported in the revised manuscript to reduce the risk of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; costs explicitly defined and results emerge from joint optimization with external validation

full rationale

The paper defines differentiable cost terms for breadth, depth, and time, then optimizes them jointly with task loss via backpropagation on a recurrent conv net treated as a lattice subset. Emergent behaviors (resource trade-offs, growth with complexity, extra recurrent steps under occlusion, human RT correlation) are simulation outcomes rather than inputs by construction. No equations or self-citations reduce any central claim to a fitted parameter renamed as prediction or to a self-referential definition. The human RT correlation supplies an independent external benchmark. This is a standard normative modeling setup with chosen but transparent functional forms; no load-bearing step collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Abstract only; the framework rests on the unstated premise that the chosen cost functions are appropriate normative models and that backpropagation through the recurrent lattice is sufficient to discover biologically plausible architectures. No explicit free parameters, axioms, or invented entities are listed.

free parameters (1)

resource pressure coefficients (lambdas for breadth, depth, time)
The abstract states that different pressures are set on the three resources; these scalar multipliers are chosen by the experimenter and directly control the emergent architectures.

axioms (2)

standard math Backpropagation can jointly optimize task loss and differentiable resource costs
Implicit in the statement that costs are optimized jointly with task errors via backpropagation.
domain assumption The recurrent convolutional network can be treated as a finite subset of an infinite lattice without loss of generality for the resource trade-offs
Stated in the abstract as the modeling choice.

invented entities (1)

differentiable cost terms for breadth, depth, and time no independent evidence
purpose: To penalize resource use so that networks grow organically under different pressures
These terms are defined by the authors and are central to the framework; no independent evidence outside the model is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5693 in / 1510 out tokens · 22251 ms · 2026-06-29T23:20:37.846012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 15 canonical work pages · 7 internal anchors

[1]

Jascha Achterberg, Danyal Akarca, D. J. Strouse, John Duncan, and Duncan E. Astle. Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuroscience findings. Nature Machine Intelligence, 5(12):1369–1381, November 2023. ISSN 2522-5839. doi: 10.1038/ s42256-023-00748-9. URLhttps://www.nature.com/articles/s4...

2023
[2]

Predictive coding is a consequence of energy efficiency in recurrent neural networks.Patterns, 3(12), 2022

Abdullahi Ali, Nasir Ahmad, Elgar de Groot, Marcel Antonius Johannes van Gerven, and Tim Christian Kietzmann. Predictive coding is a consequence of energy efficiency in recurrent neural networks.Patterns, 3(12), 2022

2022
[3]

Adaptive computation as a new mechanism of dynamic human attention.Psychological Review, 133(3):534, 2026

Mario Belledonne, Eivinas Butkus, Brian J Scholl, and Ilker Yildirim. Adaptive computation as a new mechanism of dynamic human attention.Psychological Review, 133(3):534, 2026

2026
[4]

Nicholas M Blauch, Marlene Behrmann, and David C Plaut. A connectivity-constrained computational account of topographic organization in primate high-level visual cortex.Proceedings of the National Academy of Sciences, 119(3):e2112566119, 2022

2022
[5]

How attention saves energy in vision.bioRxiv,

Eivinas Butkus, Zhuofan Ying, and Nikolaus Kriegeskorte. How attention saves energy in vision.bioRxiv,
[6]

doi: 10.64898/2026.03.18.710397

work page doi:10.64898/2026.03.18.710397 2026
[7]

Chen, David H

Beth L. Chen, David H. Hall, and Dmitri B. Chklovskii. Wiring optimization can relate neuronal structure and function.Proceedings of the National Academy of Sciences of the United States of America, 103(12): 4723–4728, March 2006. ISSN 0027-8424. doi: 10.1073/pnas.0506806103

work page doi:10.1073/pnas.0506806103 2006
[8]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary Differential Equations, December 2019. URLhttp://arxiv.org/abs/1806.07366. arXiv:1806.07366 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

Wiring optimization in cortical circuits

Dmitri B Chklovskii, Thomas Schikorski, and Charles F Stevens. Wiring optimization in cortical circuits. Neuron, 34(3):341–347, 2002

2002
[10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255,
[11]

doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[12]

Aldo Faisal, Luc P

A. Aldo Faisal, Luc P. J. Selen, and Daniel M. Wolpert. Noise in the nervous system.Nature Reviews Neuroscience, 9(4):292–303, April 2008. ISSN 1471-003X, 1471-0048. doi: 10.1038/nrn2258. URL https://www.nature.com/articles/nrn2258

work page doi:10.1038/nrn2258 2008
[13]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=rJl-b3RcF7

2019
[14]

Wichmann

Robert Geirhos, Kristof Meding, and Felix A. Wichmann. Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency.Advances in Neural Information Pro- cessing Systems, 33:13890–13902, 2020. URL https://proceedings.neurips.cc/paper_files/ paper/2020/hash/9f6992966d4c363ea0162a056cb45fe5-Abstract.html

2020
[15]

Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.Science, 349(6245):273–278, 2015

Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.Science, 349(6245):273–278, 2015

2015
[16]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Asso- ciates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ ae0eb...

2015
[18]

Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, February 2016. URL http://arxiv.org/ abs/1510.00149. arXiv:1510.00149 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 11

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Ho, David Abel, Carlos G

Mark K. Ho, David Abel, Carlos G. Correa, Michael L. Littman, Jonathan D. Cohen, and Thomas L. Griffiths. People construct simplified mental representations to plan.Nature, 606(7912):129–136, June
[21]

doi: 10.1038/s41586-022-04743-9

ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-022-04743-9. URL https://www.nature. com/articles/s41586-022-04743-9

work page doi:10.1038/s41586-022-04743-9
[22]

C.et al.Recurrence is required to capture the representational dynam- ics of the human visual system.Proceedings of the National Academy of Sciences116, 21854–21863 (2019)

Tim C. Kietzmann, Courtney J. Spoerer, Lynn K. A. Sörensen, Radoslaw M. Cichy, Olaf Hauk, and Nikolaus Kriegeskorte. Recurrence is required to capture the representational dynamics of the human visual system.Proceedings of the National Academy of Sciences, 116(43):21854–21863, October 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1905544116. URL http...

work page doi:10.1073/pnas.1905544116 2019
[23]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009

2009
[24]

Communication in neuronal networks.Science, 301(5641): 1870–1874, 2003

Simon B Laughlin and Terrence J Sejnowski. Communication in neuronal networks.Science, 301(5641): 1870–1874, 2003

2003
[25]

On the value of model diversity in neuroscience.Nature Reviews Neuroscience, 21(8): 395–396, 2020

Gilles Laurent. On the value of model diversity in neuroscience.Nature Reviews Neuroscience, 21(8): 395–396, 2020

2020
[26]

Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

2015
[27]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989
[28]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998
[29]

Pruning Filters for Efficient ConvNets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets, March 2017. URLhttp://arxiv.org/abs/1608.08710. arXiv:1608.08710 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and Brain Sciences, 43:e1, 2020

Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and Brain Sciences, 43:e1, 2020

2020
[31]

A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs

Jack Lindsey, Samuel A Ocko, Surya Ganguli, and Stephane Deny. A unified theory of early visual represen- tations from retina to cortex through anatomically constrained deep cnns.arXiv preprint arXiv:1901.00945, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[32]

Progressive neural architecture search

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

2018
[33]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=S1eYHoC5FX

2019
[34]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017

2017
[35]

DiCarlo, Kalanit Grill-Spector, and Daniel L.K

Eshed Margalit, Hyodong Lee, Dawn Finzi, James J. DiCarlo, Kalanit Grill-Spector, and Daniel L.K. Yamins. A unifying framework for functional organization in early and higher ventral visual cortex. Neuron, 112(14):2435–2451.e7, July 2024. ISSN 08966273. doi: 10.1016/j.neuron.2024.04.018. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627324002794

work page doi:10.1016/j.neuron.2024.04.018 2024
[36]

Human uncertainty makes classification more robust

Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF international conference on computer vision, pages 9617–9626, 2019

2019
[37]

A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

Herbert A Simon. A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

1955
[38]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Spoerer, Patrick McClure, and Nikolaus Kriegeskorte

Courtney J. Spoerer, Patrick McClure, and Nikolaus Kriegeskorte. Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition.Frontiers in Psychology, 8:1551, September
[40]

doi: 10.3389/fpsyg.2017.01551

ISSN 1664-1078. doi: 10.3389/fpsyg.2017.01551. URL https://www.frontiersin.org/ article/10.3389/fpsyg.2017.01551/full. 12

work page doi:10.3389/fpsyg.2017.01551 2017
[41]

Re- current neural networks can explain flexible trading of speed and accuracy in biological vision.PLOS Computational Biology, 16(10):e1008215, 2020

Courtney J Spoerer, Tim C Kietzmann, Johannes Mehrer, Ian Charest, and Nikolaus Kriegeskorte. Re- current neural networks can explain flexible trading of speed and accuracy in biological vision.PLOS Computational Biology, 16(10):e1008215, 2020

2020
[42]

MIT Press, 2015

Peter Sterling and Simon Laughlin.Principles of neural design. MIT Press, 2015

2015
[43]

Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996

Simon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996

1996
[44]

One and done? Optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

Edward Vul, Noah Goodman, Thomas L Griffiths, and Joshua B Tenenbaum. One and done? Optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

2014
[45]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014. 13 A Compute Each model was trained on a single GPU for approximately 2.5 hours, requiring roughly 3.3 GB of GPU memory at batch size 128. Training was conducted on a university cluster with a mix ...

2014

[1] [1]

Jascha Achterberg, Danyal Akarca, D. J. Strouse, John Duncan, and Duncan E. Astle. Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuroscience findings. Nature Machine Intelligence, 5(12):1369–1381, November 2023. ISSN 2522-5839. doi: 10.1038/ s42256-023-00748-9. URLhttps://www.nature.com/articles/s4...

2023

[2] [2]

Predictive coding is a consequence of energy efficiency in recurrent neural networks.Patterns, 3(12), 2022

Abdullahi Ali, Nasir Ahmad, Elgar de Groot, Marcel Antonius Johannes van Gerven, and Tim Christian Kietzmann. Predictive coding is a consequence of energy efficiency in recurrent neural networks.Patterns, 3(12), 2022

2022

[3] [3]

Adaptive computation as a new mechanism of dynamic human attention.Psychological Review, 133(3):534, 2026

Mario Belledonne, Eivinas Butkus, Brian J Scholl, and Ilker Yildirim. Adaptive computation as a new mechanism of dynamic human attention.Psychological Review, 133(3):534, 2026

2026

[4] [4]

Nicholas M Blauch, Marlene Behrmann, and David C Plaut. A connectivity-constrained computational account of topographic organization in primate high-level visual cortex.Proceedings of the National Academy of Sciences, 119(3):e2112566119, 2022

2022

[5] [5]

How attention saves energy in vision.bioRxiv,

Eivinas Butkus, Zhuofan Ying, and Nikolaus Kriegeskorte. How attention saves energy in vision.bioRxiv,

[6] [6]

doi: 10.64898/2026.03.18.710397

work page doi:10.64898/2026.03.18.710397 2026

[7] [7]

Chen, David H

Beth L. Chen, David H. Hall, and Dmitri B. Chklovskii. Wiring optimization can relate neuronal structure and function.Proceedings of the National Academy of Sciences of the United States of America, 103(12): 4723–4728, March 2006. ISSN 0027-8424. doi: 10.1073/pnas.0506806103

work page doi:10.1073/pnas.0506806103 2006

[8] [8]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary Differential Equations, December 2019. URLhttp://arxiv.org/abs/1806.07366. arXiv:1806.07366 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

Wiring optimization in cortical circuits

Dmitri B Chklovskii, Thomas Schikorski, and Charles F Stevens. Wiring optimization in cortical circuits. Neuron, 34(3):341–347, 2002

2002

[10] [10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255,

[11] [11]

doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[12] [12]

Aldo Faisal, Luc P

A. Aldo Faisal, Luc P. J. Selen, and Daniel M. Wolpert. Noise in the nervous system.Nature Reviews Neuroscience, 9(4):292–303, April 2008. ISSN 1471-003X, 1471-0048. doi: 10.1038/nrn2258. URL https://www.nature.com/articles/nrn2258

work page doi:10.1038/nrn2258 2008

[13] [13]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=rJl-b3RcF7

2019

[14] [14]

Wichmann

Robert Geirhos, Kristof Meding, and Felix A. Wichmann. Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency.Advances in Neural Information Pro- cessing Systems, 33:13890–13902, 2020. URL https://proceedings.neurips.cc/paper_files/ paper/2020/hash/9f6992966d4c363ea0162a056cb45fe5-Abstract.html

2020

[15] [15]

Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.Science, 349(6245):273–278, 2015

Samuel J Gershman, Eric J Horvitz, and Joshua B Tenenbaum. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines.Science, 349(6245):273–278, 2015

2015

[16] [16]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Asso- ciates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ ae0eb...

2015

[18] [18]

Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, February 2016. URL http://arxiv.org/ abs/1510.00149. arXiv:1510.00149 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 11

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

Ho, David Abel, Carlos G

Mark K. Ho, David Abel, Carlos G. Correa, Michael L. Littman, Jonathan D. Cohen, and Thomas L. Griffiths. People construct simplified mental representations to plan.Nature, 606(7912):129–136, June

[21] [21]

doi: 10.1038/s41586-022-04743-9

ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-022-04743-9. URL https://www.nature. com/articles/s41586-022-04743-9

work page doi:10.1038/s41586-022-04743-9

[22] [22]

C.et al.Recurrence is required to capture the representational dynam- ics of the human visual system.Proceedings of the National Academy of Sciences116, 21854–21863 (2019)

Tim C. Kietzmann, Courtney J. Spoerer, Lynn K. A. Sörensen, Radoslaw M. Cichy, Olaf Hauk, and Nikolaus Kriegeskorte. Recurrence is required to capture the representational dynamics of the human visual system.Proceedings of the National Academy of Sciences, 116(43):21854–21863, October 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1905544116. URL http...

work page doi:10.1073/pnas.1905544116 2019

[23] [23]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009

2009

[24] [24]

Communication in neuronal networks.Science, 301(5641): 1870–1874, 2003

Simon B Laughlin and Terrence J Sejnowski. Communication in neuronal networks.Science, 301(5641): 1870–1874, 2003

2003

[25] [25]

On the value of model diversity in neuroscience.Nature Reviews Neuroscience, 21(8): 395–396, 2020

Gilles Laurent. On the value of model diversity in neuroscience.Nature Reviews Neuroscience, 21(8): 395–396, 2020

2020

[26] [26]

Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

2015

[27] [27]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989

[28] [28]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998

[29] [29]

Pruning Filters for Efficient ConvNets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets, March 2017. URLhttp://arxiv.org/abs/1608.08710. arXiv:1608.08710 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and Brain Sciences, 43:e1, 2020

Falk Lieder and Thomas L Griffiths. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and Brain Sciences, 43:e1, 2020

2020

[31] [31]

A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs

Jack Lindsey, Samuel A Ocko, Surya Ganguli, and Stephane Deny. A unified theory of early visual represen- tations from retina to cortex through anatomically constrained deep cnns.arXiv preprint arXiv:1901.00945, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[32] [32]

Progressive neural architecture search

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

2018

[33] [33]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=S1eYHoC5FX

2019

[34] [34]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017

2017

[35] [35]

DiCarlo, Kalanit Grill-Spector, and Daniel L.K

Eshed Margalit, Hyodong Lee, Dawn Finzi, James J. DiCarlo, Kalanit Grill-Spector, and Daniel L.K. Yamins. A unifying framework for functional organization in early and higher ventral visual cortex. Neuron, 112(14):2435–2451.e7, July 2024. ISSN 08966273. doi: 10.1016/j.neuron.2024.04.018. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627324002794

work page doi:10.1016/j.neuron.2024.04.018 2024

[36] [36]

Human uncertainty makes classification more robust

Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF international conference on computer vision, pages 9617–9626, 2019

2019

[37] [37]

A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

Herbert A Simon. A behavioral model of rational choice.The quarterly journal of economics, pages 99–118, 1955

1955

[38] [38]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Spoerer, Patrick McClure, and Nikolaus Kriegeskorte

Courtney J. Spoerer, Patrick McClure, and Nikolaus Kriegeskorte. Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition.Frontiers in Psychology, 8:1551, September

[40] [40]

doi: 10.3389/fpsyg.2017.01551

ISSN 1664-1078. doi: 10.3389/fpsyg.2017.01551. URL https://www.frontiersin.org/ article/10.3389/fpsyg.2017.01551/full. 12

work page doi:10.3389/fpsyg.2017.01551 2017

[41] [41]

Re- current neural networks can explain flexible trading of speed and accuracy in biological vision.PLOS Computational Biology, 16(10):e1008215, 2020

Courtney J Spoerer, Tim C Kietzmann, Johannes Mehrer, Ian Charest, and Nikolaus Kriegeskorte. Re- current neural networks can explain flexible trading of speed and accuracy in biological vision.PLOS Computational Biology, 16(10):e1008215, 2020

2020

[42] [42]

MIT Press, 2015

Peter Sterling and Simon Laughlin.Principles of neural design. MIT Press, 2015

2015

[43] [43]

Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996

Simon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996

1996

[44] [44]

One and done? Optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

Edward Vul, Noah Goodman, Thomas L Griffiths, and Joshua B Tenenbaum. One and done? Optimal decisions from very few samples.Cognitive science, 38(4):599–637, 2014

2014

[45] [45]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014. 13 A Compute Each model was trained on a single GPU for approximately 2.5 hours, requiring roughly 3.3 GB of GPU memory at batch size 128. Training was conducted on a university cluster with a mix ...

2014