pith. sign in

arxiv: 1907.08040 · v1 · pith:GDCBZB2Rnew · submitted 2019-07-18 · 💻 cs.LG · cs.NE· stat.ML

Convolutional Reservoir Computing for World Models

Pith reviewed 2026-05-24 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.NEstat.ML
keywords reinforcement learningreservoir computingconvolutional neural networksevolution strategyfixed random weightsfeature extractionworld models
0
0 comments X

The pith

A reinforcement learning model using random fixed-weight convolutional and reservoir layers achieves state-of-the-art scores without training those layers or storing data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RCRC model for reinforcement learning that relies on random fixed-weight CNNs to extract visual features and reservoir computing for time-series features. These components do not require training, allowing the model to process data quickly and avoid storing large volumes of past playing data. Actions are decided using an evolution strategy. The approach reaches state-of-the-art performance on a popular RL task. Even simpler networks with only one dense layer and fixed random weights can achieve high scores.

Core claim

The RCRC model extracts visual and time-series features very fast because it uses random fixed-weight CNN and the reservoir computing model. It does not require the training data to be stored because it extracts features without training and decides action with evolution strategy. Furthermore, the model achieves state of the art score in the popular reinforcement learning task. Incredibly, random weight-fixed simple networks like only one dense layer network can also reach high score in the RL task.

What carries the argument

Convolutional reservoir computing (RCRC) with random fixed-weight CNN and reservoir layers, paired with evolution strategy for action selection.

If this is right

  • Feature extraction occurs without training the CNN or reservoir layers.
  • Past playing data does not need to be stored.
  • The model reaches state-of-the-art scores on standard RL benchmarks.
  • Simple fixed-weight networks consisting of only one dense layer perform well on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could reduce computational resources required for RL in visual environments by eliminating weight training in the feature extractors.
  • It suggests that many control tasks may not require learned features and that random projections can suffice in the tested settings.
  • Fixed-weight reservoir approaches might extend to other sequential decision problems if the environments share similar visual and temporal structure.

Load-bearing premise

Random fixed weights in the CNN and reservoir computing layers are sufficient to extract task-relevant visual and temporal features for the RL environments tested, without any training or adaptation of those weights.

What would settle it

A direct comparison on the same RL task showing that a version with trained CNN and reservoir weights significantly outperforms the fixed random version or that the fixed version falls below competitive scores.

Figures

Figures reproduced from arXiv: 1907.08040 by Hanten Chang, Katsuya Futagami.

Figure 1
Figure 1. Figure 1: Reservoir Computing overview for the time-series prediction task. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RCRC overview to choose the action for CarRacing-v0: the first and second layers are collectively called the convolutional reservoir computing layer, and both layers’ model weights are sampled from Gaussian distribution and then fixed. transformation of these features. This implies that it only requires features that sufficiently express the environment state, rather than features trained to solve the task… view at source ↗
Figure 3
Figure 3. Figure 3: Example environment state image of CarRacing-v0 and three parameters in the enviroments. The score is added when the car passes through a tile laid on the course. In this process, T represents an update step of the weight matrix Wout, and n is the number of solution candidates Wout generated at each step. The worker is an agent that implements RCRC, and each worker extracts features, takes the action and p… view at source ↗
Figure 4
Figure 4. Figure 4: The best average score over 8 randomly created tracks among 16 workers at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Recently, reinforcement learning models have achieved great success, completing complex tasks such as mastering Go and other games with higher scores than human players. Many of these models collect considerable data on the tasks and improve accuracy by extracting visual and time-series features using convolutional neural networks (CNNs) and recurrent neural networks, respectively. However, these networks have very high computational costs because they need to be trained by repeatedly using a large volume of past playing data. In this study, we propose a novel practical approach called reinforcement learning with convolutional reservoir computing (RCRC) model. The RCRC model has several desirable features: 1. it can extract visual and time-series features very fast because it uses random fixed-weight CNN and the reservoir computing model; 2. it does not require the training data to be stored because it extracts features without training and decides action with evolution strategy. Furthermore, the model achieves state of the art score in the popular reinforcement learning task. Incredibly, we find the random weight-fixed simple networks like only one dense layer network can also reach high score in the RL task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the RCRC model for reinforcement learning, which extracts visual features via random fixed-weight CNN layers and temporal features via reservoir computing, then uses an evolution strategy to select actions. It claims this avoids the need to store or repeatedly train on large volumes of past data, achieves state-of-the-art scores on popular RL tasks, and that even a single random dense layer can reach high performance.

Significance. If the empirical results hold with proper controls, the work would be significant for demonstrating that untrained random projections can suffice for competitive RL performance, substantially lowering computational cost and memory requirements compared to trained CNN/RNN feature extractors. The data-free aspect and the surprising efficacy of minimal random networks would be notable contributions to efficient world-model approaches in RL.

major comments (2)
  1. [Experiments section (inferred from abstract claims)] The central empirical claim (SOTA performance via untrained random CNN and reservoir layers) is load-bearing yet unsupported by any analysis of why the particular random initialization succeeds; no feature visualizations, ablation on reservoir spectral radius, or comparison against trained CNN baselines appear to be present to address the weakest assumption that random fixed weights extract task-relevant features.
  2. [Abstract and results claims] The assertion that 'only one dense layer network can also reach high score' requires quantitative evidence (e.g., scores, baselines, variance) to be load-bearing; without reported benchmark names, error bars, or statistical tests, the claim that random fixed networks match trained models cannot be evaluated.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific RL environments, baselines, and quantitative scores to allow immediate assessment of the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of the empirical results.

read point-by-point responses
  1. Referee: [Experiments section (inferred from abstract claims)] The central empirical claim (SOTA performance via untrained random CNN and reservoir layers) is load-bearing yet unsupported by any analysis of why the particular random initialization succeeds; no feature visualizations, ablation on reservoir spectral radius, or comparison against trained CNN baselines appear to be present to address the weakest assumption that random fixed weights extract task-relevant features.

    Authors: We agree that the manuscript would benefit from additional analyses to better support the assumption that random fixed weights extract relevant features. The current work emphasizes the practical advantages and observed performance, but we will add feature visualizations, an ablation study on the reservoir spectral radius, and comparisons against trained CNN baselines in the revised version. revision: yes

  2. Referee: [Abstract and results claims] The assertion that 'only one dense layer network can also reach high score' requires quantitative evidence (e.g., scores, baselines, variance) to be load-bearing; without reported benchmark names, error bars, or statistical tests, the claim that random fixed networks match trained models cannot be evaluated.

    Authors: The manuscript reports results on standard reinforcement learning benchmarks, but we acknowledge that more detailed quantitative support—including explicit benchmark names, scores with error bars, variance across runs, and statistical comparisons—would make the claim more readily evaluable. We will expand the results section with these elements in the revision. revision: yes

Circularity Check

0 steps flagged

Empirical proposal with no derivation chain or fitted predictions

full rationale

The paper presents an empirical model (random fixed-weight CNN + reservoir computing + evolution strategy) and reports experimental RL scores. No equations, derivations, or first-principles results appear; claims are not quantities defined in terms of fitted parameters, self-citations, or ansatzes that reduce to inputs by construction. The central assertion (untrained random weights suffice for SOTA) is an empirical hypothesis tested on environments, not a self-referential prediction. This matches the default case of a self-contained empirical result with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on the domain assumption that untrained random networks extract useful features and on the implicit modeling choice that evolution strategies suffice for policy search in the tested environments. No free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption Random fixed weights in CNN and reservoir layers extract task-relevant features without training
    The model is built on this premise to avoid training the feature extractors.

pith-pipeline@v0.9.0 · 5717 in / 1244 out tokens · 24605 ms · 2026-05-24T19:43:18.588887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 12 internal anchors

  1. [1]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016

  2. [2]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

  3. [3]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  4. [4]

    Distributed Prioritized Experience Replay

    Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018

  5. [5]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019

  6. [6]

    Deep reinforcement learning: A brief survey

    Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017

  7. [7]

    Deep recurrent q-learning for partially observable mdps

    Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015

  8. [8]

    Asynchronous methods for deep reinforcement learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

  9. [9]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

  10. [10]

    World Models

    David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  11. [11]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2450–2462. Curran Associates, Inc., 2018

  12. [12]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  13. [13]

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014

  14. [14]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  15. [15]

    Recurrent neural network tutorial for artists

    David Ha. Recurrent neural network tutorial for artists. blog.otoro.net, 2017

  16. [16]

    Completely derandomized self-adaptation in evolution strategies

    Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001

  17. [17]

    The CMA Evolution Strategy: A Tutorial

    Nikolaus Hansen. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016

  18. [18]

    An experimental unification of reservoir computing methods

    David Verstraeten, Benjamin Schrauwen, Michiel d’Haene, and Dirk Stroobandt. An experimental unification of reservoir computing methods. Neural networks, 20(3):391–403, 2007. 9 A PREPRINT - JULY 19, 2019

  19. [19]

    Reservoir computing approaches to recurrent neural network training

    Mantas Lukoševiˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009

  20. [20]

    echo state

    Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001

  21. [21]

    Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication

    Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004

  22. [22]

    A practical guide to applying echo state networks

    Mantas Lukoševiˇcius. A practical guide to applying echo state networks. In Neural networks: Tricks of the trade, pages 659–686. Springer, 2012

  23. [23]

    Time series classification using time warping invariant echo state networks

    Pattreeya Tanisaro and Gunther Heidemann. Time series classification using time warping invariant echo state networks. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 831–836. IEEE, 2016

  24. [24]

    Functional echo state network for time series classification

    Qianli Ma, Lifeng Shen, Weibiao Chen, Jiabin Wang, Jia Wei, and Zhiwen Yu. Functional echo state network for time series classification. Information Sciences, 373:1–20, 2016

  25. [25]

    Reinforcement learning with echo state networks

    István Szita, Viktor Gyenes, and András L˝orincz. Reinforcement learning with echo state networks. In Interna- tional Conference on Artificial Neural Networks, pages 830–839. Springer, 2006

  26. [26]

    Reservoir computing with untrained convolutional neural networks for image recognition

    Zhiqiang Tong and Gouhei Tanaka. Reservoir computing with untrained convolutional neural networks for image recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1289–1294. IEEE, 2018

  27. [27]

    Reservoir computing beyond memory-nonlinearity trade-off

    Masanobu Inubushi and Kazuyuki Yoshimura. Reservoir computing beyond memory-nonlinearity trade-off. Scientific reports, 7(1):10199, 2017

  28. [28]

    Effect of shapes of activation functions on predictability in the echo state network

    Hanten Chang, Shinji Nakaoka, and Hiroyasu Ando. Effect of shapes of activation functions on predictability in the echo state network. arXiv preprint arXiv:1905.09419, 2019

  29. [29]

    A Comparative Study of Reservoir Computing for Temporal Signal Processing

    Alireza Goudarzi, Peter Banda, Matthew R. Lakin, Christof Teuscher, and Darko Stefanovic. A comparative study of reservoir computing for temporal signal processing. arXiv preprint arXiv:1401.2224, 2014

  30. [30]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy P. Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

  31. [31]

    Carracing-v0

    Oleg Klimov. Carracing-v0. https://gym.openai.com/envs/CarRacing-v0/, 2016

  32. [32]

    world models

    Corentin Tallec, Léonard Blier, and Diviyan Kalainathan. Reproducing "world models". is training the recurrent network really needed ? https://ctallec.github.io/world-models/, 2018

  33. [33]

    Sebastian Risi and Kenneth O. Stanley. Deep neuroevolution of recurrent and discrete world models. arXiv preprint arXiv:1906.08857, 2019

  34. [34]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  35. [35]

    The mnist database of handwritten digits

    Yann LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998

  36. [36]

    Openai gym, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

  37. [37]

    Luc. Prieur. Deep-Q learning for Box2d racecar RL problem. https://goo.gl/VpDqSw, 2017

  38. [38]

    Solving ope- nai’s car racing environment with deep reinforcement learning and dropout

    Patrik Gerber, Jiajing Guan, Elvis Nunez, Kaman Phamdo, Tonmoy Monsoor, and Nicholas Malaya. Solving ope- nai’s car racing environment with deep reinforcement learning and dropout. https://github.com/AMD-RIPS/ RL-2018/blob/master/documents/nips/nips_2018.pdf, 2018

  39. [39]

    Reinforcement Car Racing with A3C

    Se Won Jang, Jesik Min, and Chan Lee. Reinforcement Car Racing with A3C. https://www.scribd.com/ document/358019044/, 2017

  40. [40]

    Mean-field theory of echo state networks

    Marc Massar and Serge Massar. Mean-field theory of echo state networks. Physical Review E, 87(4):042809, 2013. 10