Convolutional Reservoir Computing for World Models
Pith reviewed 2026-05-24 19:43 UTC · model grok-4.3
The pith
A reinforcement learning model using random fixed-weight convolutional and reservoir layers achieves state-of-the-art scores without training those layers or storing data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The RCRC model extracts visual and time-series features very fast because it uses random fixed-weight CNN and the reservoir computing model. It does not require the training data to be stored because it extracts features without training and decides action with evolution strategy. Furthermore, the model achieves state of the art score in the popular reinforcement learning task. Incredibly, random weight-fixed simple networks like only one dense layer network can also reach high score in the RL task.
What carries the argument
Convolutional reservoir computing (RCRC) with random fixed-weight CNN and reservoir layers, paired with evolution strategy for action selection.
If this is right
- Feature extraction occurs without training the CNN or reservoir layers.
- Past playing data does not need to be stored.
- The model reaches state-of-the-art scores on standard RL benchmarks.
- Simple fixed-weight networks consisting of only one dense layer perform well on these tasks.
Where Pith is reading between the lines
- This method could reduce computational resources required for RL in visual environments by eliminating weight training in the feature extractors.
- It suggests that many control tasks may not require learned features and that random projections can suffice in the tested settings.
- Fixed-weight reservoir approaches might extend to other sequential decision problems if the environments share similar visual and temporal structure.
Load-bearing premise
Random fixed weights in the CNN and reservoir computing layers are sufficient to extract task-relevant visual and temporal features for the RL environments tested, without any training or adaptation of those weights.
What would settle it
A direct comparison on the same RL task showing that a version with trained CNN and reservoir weights significantly outperforms the fixed random version or that the fixed version falls below competitive scores.
Figures
read the original abstract
Recently, reinforcement learning models have achieved great success, completing complex tasks such as mastering Go and other games with higher scores than human players. Many of these models collect considerable data on the tasks and improve accuracy by extracting visual and time-series features using convolutional neural networks (CNNs) and recurrent neural networks, respectively. However, these networks have very high computational costs because they need to be trained by repeatedly using a large volume of past playing data. In this study, we propose a novel practical approach called reinforcement learning with convolutional reservoir computing (RCRC) model. The RCRC model has several desirable features: 1. it can extract visual and time-series features very fast because it uses random fixed-weight CNN and the reservoir computing model; 2. it does not require the training data to be stored because it extracts features without training and decides action with evolution strategy. Furthermore, the model achieves state of the art score in the popular reinforcement learning task. Incredibly, we find the random weight-fixed simple networks like only one dense layer network can also reach high score in the RL task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the RCRC model for reinforcement learning, which extracts visual features via random fixed-weight CNN layers and temporal features via reservoir computing, then uses an evolution strategy to select actions. It claims this avoids the need to store or repeatedly train on large volumes of past data, achieves state-of-the-art scores on popular RL tasks, and that even a single random dense layer can reach high performance.
Significance. If the empirical results hold with proper controls, the work would be significant for demonstrating that untrained random projections can suffice for competitive RL performance, substantially lowering computational cost and memory requirements compared to trained CNN/RNN feature extractors. The data-free aspect and the surprising efficacy of minimal random networks would be notable contributions to efficient world-model approaches in RL.
major comments (2)
- [Experiments section (inferred from abstract claims)] The central empirical claim (SOTA performance via untrained random CNN and reservoir layers) is load-bearing yet unsupported by any analysis of why the particular random initialization succeeds; no feature visualizations, ablation on reservoir spectral radius, or comparison against trained CNN baselines appear to be present to address the weakest assumption that random fixed weights extract task-relevant features.
- [Abstract and results claims] The assertion that 'only one dense layer network can also reach high score' requires quantitative evidence (e.g., scores, baselines, variance) to be load-bearing; without reported benchmark names, error bars, or statistical tests, the claim that random fixed networks match trained models cannot be evaluated.
minor comments (1)
- [Abstract] The abstract would benefit from naming the specific RL environments, baselines, and quantitative scores to allow immediate assessment of the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of the empirical results.
read point-by-point responses
-
Referee: [Experiments section (inferred from abstract claims)] The central empirical claim (SOTA performance via untrained random CNN and reservoir layers) is load-bearing yet unsupported by any analysis of why the particular random initialization succeeds; no feature visualizations, ablation on reservoir spectral radius, or comparison against trained CNN baselines appear to be present to address the weakest assumption that random fixed weights extract task-relevant features.
Authors: We agree that the manuscript would benefit from additional analyses to better support the assumption that random fixed weights extract relevant features. The current work emphasizes the practical advantages and observed performance, but we will add feature visualizations, an ablation study on the reservoir spectral radius, and comparisons against trained CNN baselines in the revised version. revision: yes
-
Referee: [Abstract and results claims] The assertion that 'only one dense layer network can also reach high score' requires quantitative evidence (e.g., scores, baselines, variance) to be load-bearing; without reported benchmark names, error bars, or statistical tests, the claim that random fixed networks match trained models cannot be evaluated.
Authors: The manuscript reports results on standard reinforcement learning benchmarks, but we acknowledge that more detailed quantitative support—including explicit benchmark names, scores with error bars, variance across runs, and statistical comparisons—would make the claim more readily evaluable. We will expand the results section with these elements in the revision. revision: yes
Circularity Check
Empirical proposal with no derivation chain or fitted predictions
full rationale
The paper presents an empirical model (random fixed-weight CNN + reservoir computing + evolution strategy) and reports experimental RL scores. No equations, derivations, or first-principles results appear; claims are not quantities defined in terms of fitted parameters, self-citations, or ansatzes that reduce to inputs by construction. The central assertion (untrained random weights suffice for SOTA) is an empirical hypothesis tested on environments, not a self-referential prediction. This matches the default case of a self-contained empirical result with no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Random fixed weights in CNN and reservoir layers extract task-relevant features without training
Reference graph
Works this paper leans on
-
[1]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016
work page 2016
-
[2]
Mastering the game of go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017
work page 2017
-
[3]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[4]
Distributed Prioritized Experience Replay
Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Recurrent experience replay in distributed reinforcement learning
Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019
work page 2019
-
[6]
Deep reinforcement learning: A brief survey
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017
work page 2017
-
[7]
Deep recurrent q-learning for partially observable mdps
Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015
work page 2015
-
[8]
Asynchronous methods for deep reinforcement learning
V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016
work page 1928
-
[9]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2450–2462. Curran Associates, Inc., 2018
work page 2018
-
[12]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[13]
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[15]
Recurrent neural network tutorial for artists
David Ha. Recurrent neural network tutorial for artists. blog.otoro.net, 2017
work page 2017
-
[16]
Completely derandomized self-adaptation in evolution strategies
Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001
work page 2001
-
[17]
The CMA Evolution Strategy: A Tutorial
Nikolaus Hansen. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
An experimental unification of reservoir computing methods
David Verstraeten, Benjamin Schrauwen, Michiel d’Haene, and Dirk Stroobandt. An experimental unification of reservoir computing methods. Neural networks, 20(3):391–403, 2007. 9 A PREPRINT - JULY 19, 2019
work page 2007
-
[19]
Reservoir computing approaches to recurrent neural network training
Mantas Lukoševiˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009
work page 2009
-
[20]
Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001
work page 2001
-
[21]
Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication
Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004
work page 2004
-
[22]
A practical guide to applying echo state networks
Mantas Lukoševiˇcius. A practical guide to applying echo state networks. In Neural networks: Tricks of the trade, pages 659–686. Springer, 2012
work page 2012
-
[23]
Time series classification using time warping invariant echo state networks
Pattreeya Tanisaro and Gunther Heidemann. Time series classification using time warping invariant echo state networks. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 831–836. IEEE, 2016
work page 2016
-
[24]
Functional echo state network for time series classification
Qianli Ma, Lifeng Shen, Weibiao Chen, Jiabin Wang, Jia Wei, and Zhiwen Yu. Functional echo state network for time series classification. Information Sciences, 373:1–20, 2016
work page 2016
-
[25]
Reinforcement learning with echo state networks
István Szita, Viktor Gyenes, and András L˝orincz. Reinforcement learning with echo state networks. In Interna- tional Conference on Artificial Neural Networks, pages 830–839. Springer, 2006
work page 2006
-
[26]
Reservoir computing with untrained convolutional neural networks for image recognition
Zhiqiang Tong and Gouhei Tanaka. Reservoir computing with untrained convolutional neural networks for image recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1289–1294. IEEE, 2018
work page 2018
-
[27]
Reservoir computing beyond memory-nonlinearity trade-off
Masanobu Inubushi and Kazuyuki Yoshimura. Reservoir computing beyond memory-nonlinearity trade-off. Scientific reports, 7(1):10199, 2017
work page 2017
-
[28]
Effect of shapes of activation functions on predictability in the echo state network
Hanten Chang, Shinji Nakaoka, and Hiroyasu Ando. Effect of shapes of activation functions on predictability in the echo state network. arXiv preprint arXiv:1905.09419, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[29]
A Comparative Study of Reservoir Computing for Temporal Signal Processing
Alireza Goudarzi, Peter Banda, Matthew R. Lakin, Christof Teuscher, and Darko Stefanovic. A comparative study of reservoir computing for temporal signal processing. arXiv preprint arXiv:1401.2224, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner, Timothy P. Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Oleg Klimov. Carracing-v0. https://gym.openai.com/envs/CarRacing-v0/, 2016
work page 2016
-
[32]
Corentin Tallec, Léonard Blier, and Diviyan Kalainathan. Reproducing "world models". is training the recurrent network really needed ? https://ctallec.github.io/world-models/, 2018
work page 2018
-
[33]
Sebastian Risi and Kenneth O. Stanley. Deep neuroevolution of recurrent and discrete world models. arXiv preprint arXiv:1906.08857, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[34]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[35]
The mnist database of handwritten digits
Yann LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998
work page 1998
-
[36]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016
work page 2016
-
[37]
Luc. Prieur. Deep-Q learning for Box2d racecar RL problem. https://goo.gl/VpDqSw, 2017
work page 2017
-
[38]
Solving ope- nai’s car racing environment with deep reinforcement learning and dropout
Patrik Gerber, Jiajing Guan, Elvis Nunez, Kaman Phamdo, Tonmoy Monsoor, and Nicholas Malaya. Solving ope- nai’s car racing environment with deep reinforcement learning and dropout. https://github.com/AMD-RIPS/ RL-2018/blob/master/documents/nips/nips_2018.pdf, 2018
work page 2018
-
[39]
Reinforcement Car Racing with A3C
Se Won Jang, Jesik Min, and Chan Lee. Reinforcement Car Racing with A3C. https://www.scribd.com/ document/358019044/, 2017
-
[40]
Mean-field theory of echo state networks
Marc Massar and Serge Massar. Mean-field theory of echo state networks. Physical Review E, 87(4):042809, 2013. 10
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.