pith. sign in

arxiv: 2605.23689 · v1 · pith:7WLUHCLMnew · submitted 2026-05-22 · 💻 cs.LG · math.DS

Optimization of randomized neural networks for transfer operator approximation

Pith reviewed 2026-05-25 05:07 UTC · model grok-4.3

classification 💻 cs.LG math.DS
keywords randomized neural networkstransfer operatorsactivation function optimizationdynamical systemsdata-driven approximationRaNNDystochastic differential equations
0
0 comments X

The pith

Optimizing the activation function alone in randomized neural networks produces better dictionaries for transfer operator approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RaNNDy approximates transfer operators of dynamical systems by initializing neural network weights and biases randomly, keeping them fixed, and training only the output layer for a closed-form solution and low cost. The basis functions that form the dictionary are shaped by the choice of activation function, so an initial poor choice restricts the approximation quality. The paper presents an algorithm that tunes the activation function itself while leaving the random weights and biases unchanged, thereby constructing a more suitable dictionary. The method is demonstrated on stochastic differential equations and random walks on graphons.

Core claim

By optimizing the activation function of a randomized neural network while keeping its randomly initialized weights and biases fixed, a more suitable dictionary can be obtained for the data-driven approximation of transfer operators associated with complex dynamical systems.

What carries the argument

An algorithm that optimizes the activation function in RaNNDy while keeping randomly initialized weights and biases fixed.

If this is right

  • The closed-form training of the output layer and low overall training cost of RaNNDy are preserved.
  • Improved dictionaries become available for approximating transfer operators of stochastic differential equations.
  • The same fixed-weight network can be adapted to random walks on graphons without retraining hidden-layer parameters.
  • Dictionary quality improves without the computational expense of fully optimizing all network weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same activation-tuning step could be inserted into other randomized architectures that rely on fixed hidden layers.
  • Sensitivity of approximation quality to the initial random draw of weights may decrease once the activation is free to adjust.
  • A two-stage procedure emerges in which randomization sets the scale and activation tuning refines the shape of the basis.

Load-bearing premise

Adjusting only the activation function is enough to overcome the restriction that fixed random weights and biases place on the basis functions.

What would settle it

A benchmark dynamical system on which the optimized activation function produces no reduction in approximation error compared with standard choices would falsify the central efficacy claim.

Figures

Figures reproduced from arXiv: 2605.23689 by Mohammad Tabish, Stefan Klus.

Figure 1
Figure 1. Figure 1: (a) Symmetric graphon g. (b) The associated transition density function p. (c) A random walk on the graphon. where φi is an eigenfunction of the Koopman operator and φbi = πφi the corresponding eigenfunction of the Perron–Frobenius operator. This also allows us to reconstruct the transition probability density p(x, y) and the graphon g(x, y), i.e., p(x, y) = X i λiφi(x)φbi(y) and g(x, y) = Z X i λiφbi(x)φb… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Training loss for the optimization of RaNNDy with the tanh activation parameter ω. (b) Five dominant eigenvalues of the Koopman operator after training. (c) Three dominant associated Koopman operator eigenfunctions, where denotes the first, the second, and the third eigenfunction. (d) The associated Perron–Frobenius operator eigenfunctions. (e) Reconstructed graphon g with rank 3. (f) Corresponding tra… view at source ↗
Figure 3
Figure 3. Figure 3: Selecting the distributions for the hidden layers’ weights and biases in RaNNDy. (a) Grid search for different distributions. (b) Loss surface for different values of scales of the normal distributions. (c) Comparison of training the whole network (VAMPnets with learning rate lr = 10−2 ) vs. only the scales of the distributions for the initializations of RaNNDy. We can see that the optimizer is stuck in a … view at source ↗
Figure 4
Figure 4. Figure 4: Bickley jet flow at time (a) t = 0, (b) t = 25, and (c) t = 50. (a) 0 2 4 6 8 epochs 2 5 8 11 tr( A ) 1 2 ω W (b) 2 4 6 8 i 0.0 0.2 0.4 0.6 0.8 1.0 λ i Initial eigenvalues Optimized eigenvalues (c) 0 3 6 9 12 15 18 x −3 −1 1 3 y (d) 0 3 6 9 12 15 18 x −3 −1 1 3 y [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Optimization of the activation function for the Bickley jet. (a) The loss func￾tion and the activation parameter values. (b) Eigenvalues before and after optimization. (c) Clustering of the initial nine dominant singular functions into nine clusters. (d) Opti￾mized clustering. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Ten dominant eigenvalues of the approximated Koopman operator. (b) The second dominant eigenfunction to distinguish the folded and unfolded states, where denotes the initial and denotes the optimized eigenfunction evaluated at the data points. (c) & (d) Some folded and unfolded states of the NuG2 protein extracted using the second eigenfunction. (e) & (f) Contact map frequencies between different resid… view at source ↗
read the original abstract

RaNNDy is a randomized neural network architecture for the data-driven approximation of transfer operators associated with complex dynamical systems. The weights and biases of the hidden layers of the network are randomly initialized and kept fixed, only the output layer is trained. This has several advantages over fully optimized neural networks, notably a closed-form solution for the output layer and significantly lower training costs. Despite these advantages, RaNNDy is restricted to the initial selection of weights and biases that parametrize the basis functions required for the operator approximation. Since the basis functions are determined by the activation function, choosing an appropriate activation function for the hidden layers is crucial. In this work, we propose an algorithm that optimizes the activation function itself, while keeping the weights and biases in the randomized neural network fixed, providing a more suitable dictionary. We illustrate the efficacy of the approach using various benchmark problems, including stochastic differential equations and random walks on graphons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RaNNDy, a randomized neural network architecture for data-driven approximation of transfer operators in dynamical systems. Weights and biases in hidden layers are randomly initialized and fixed, with only the output layer trained; the new contribution is an algorithm that optimizes the activation function itself to produce a more suitable dictionary of basis functions while leaving the random parameters unchanged. Efficacy is illustrated on benchmarks including stochastic differential equations and random walks on graphons.

Significance. If the optimization of the activation function demonstrably improves the dictionary beyond what fixed random features allow, the method would offer a low-cost way to adapt randomized networks for operator approximation without full retraining or additional random features. This could be useful for high-dimensional or graph-based dynamical systems where standard random dictionaries are insufficient.

major comments (2)
  1. [Abstract, §3] Abstract and the description of the proposed algorithm: the claim that optimizing the activation function overcomes the restriction imposed by the initial random weights and biases is not supported. Basis functions remain of the form σ(w_i · x + b_i) with fixed random w_i, b_i; varying σ only warps the nonlinearity along those fixed projections and cannot recover directions missed by the random feature map. The manuscript must supply either a theoretical argument showing how the optimized σ expands the spanned space or a concrete numerical test (e.g., a low-dimensional example where the initial random dictionary fails but the optimized-σ version succeeds).
  2. [§4] §4 (numerical experiments): the reported improvements on SDE and graphon benchmarks lack quantitative comparison to the baseline RaNNDy with standard activations, error bounds, or ablation on the number of random features. Without these, it is impossible to determine whether the activation optimization compensates for poor random projections or merely refines an already adequate dictionary.
minor comments (2)
  1. [§3] Notation for the optimized activation function should be introduced explicitly with an equation rather than described only in prose.
  2. [Abstract, §4] The abstract states efficacy on benchmarks but the main text should include a table summarizing quantitative metrics (e.g., approximation error, runtime) across methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the scope of our contribution and committing to revisions where the manuscript requires strengthening.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and the description of the proposed algorithm: the claim that optimizing the activation function overcomes the restriction imposed by the initial random weights and biases is not supported. Basis functions remain of the form σ(w_i · x + b_i) with fixed random w_i, b_i; varying σ only warps the nonlinearity along those fixed projections and cannot recover directions missed by the random feature map. The manuscript must supply either a theoretical argument showing how the optimized σ expands the spanned space or a concrete numerical test (e.g., a low-dimensional example where the initial random dictionary fails but the optimized-σ version succeeds).

    Authors: We agree that the current wording in the abstract and §3 can be read as implying that activation optimization expands the linear span beyond the random projections, which is not the case. The method improves the dictionary by selecting a nonlinearity that better matches the target operator along the fixed random directions w_i, b_i; it does not recover missed directions. We will revise the abstract and §3 to state explicitly that the optimization mitigates the restriction on the choice of activation function for a given random feature map, without claiming to overcome limitations of the random weights themselves. We will also add a low-dimensional numerical example (e.g., a 2-D linear SDE) that isolates the effect of σ optimization within a deliberately poor random dictionary and quantifies the resulting improvement in operator approximation error. revision: yes

  2. Referee: [§4] §4 (numerical experiments): the reported improvements on SDE and graphon benchmarks lack quantitative comparison to the baseline RaNNDy with standard activations, error bounds, or ablation on the number of random features. Without these, it is impossible to determine whether the activation optimization compensates for poor random projections or merely refines an already adequate dictionary.

    Authors: We acknowledge that the current §4 presents results primarily for the optimized activation without systematic side-by-side metrics against fixed standard activations (e.g., ReLU, tanh), without reported error bounds, and without ablation on the number of random features N. We will revise §4 to include: (i) direct quantitative comparisons (L2 operator error, eigenvalue errors) between optimized-σ RaNNDy and standard-activation RaNNDy on the same random seeds; (ii) error bounds or confidence intervals derived from multiple random initializations; and (iii) ablation plots showing approximation error versus N for both optimized and baseline activations on the SDE and graphon examples. These additions will clarify whether the gains arise from compensating for suboptimal projections or from refining an already sufficient dictionary. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents RaNNDy as a randomized NN architecture with fixed random weights/biases and an algorithm to optimize the activation function for a better dictionary in transfer operator approximation. No equations, self-citations, or claims in the provided text reduce any result to fitted parameters, self-definitions, or prior author work by construction. The method is described as an independent algorithmic proposal without load-bearing reductions to inputs. This is the expected self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that activation function optimization can produce a superior dictionary while preserving the fixed-weight advantages of randomized networks; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Optimizing the activation function overcomes the restriction imposed by fixed random weights and biases in the basis for transfer operator approximation
    Invoked in the abstract when stating that the approach provides a more suitable dictionary.

pith-pipeline@v0.9.0 · 5679 in / 1204 out tokens · 19824 ms · 2026-05-25T05:07:20.497368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Klus and N

    S. Klus and N. D. Conrad. Dynamical systems and complex networks: A Koopman operator perspective.Journal of Physics: Complexity, 5(4):041001, 2024.doi:10. 1088/2632-072X/ad9e60

  2. [2]

    M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A data-driven approxima- tion of the Koopman operator: Extending dynamic mode decomposition.Journal of Nonlinear Science, 25:1307–1346, 2015.doi:10.1007/s00332-015-9258-5. 12 (a) 1 2 3 4 5 6 7 8 9 10 i 0.2 0.4 0.6 0.8 1.0 λi Initial eigenvalues Optimized eigenvalues (b) 0 5000 10000 15000 t −0.4 0.0 0.4 ...

  3. [3]

    S. Klus, P. Koltai, and C. Sch¨ utte. On the numerical approximation of the Perron– Frobenius and Koopman operator.Journal of Computational Dynamics, 3(1):51–79, 2016.doi:10.3934/jcd.2016003

  4. [4]

    No´ e and F

    F. No´ e and F. N¨ uske. A variational approach to modeling slow processes in stochastic dynamical systems.Multiscale Modeling & Simulation, 11(2):635–655, 2013.doi: 10.1137/110858616

  5. [5]

    N¨ uske, B

    F. N¨ uske, B. G. Keller, G. P´ erez-Hern´ andez, A. S. J. S. Mey, and F. No´ e. Varia- tional approach to molecular kinetics.Journal of chemical theory and computation, 10(4):1739–1752, 2014.doi:10.1021/ct4009156

  6. [6]

    Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode de- composition with dictionary learning: A data-driven adaptive spectral decomposition of the Koopman operator.Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(10), 2017.doi:10.1063/1.4993854

  7. [7]

    Yeung, S

    E. Yeung, S. Kundu, and N. Hodas. Learning deep neural network representations for Koopman operators of nonlinear dynamical systems. In2019 American Control Conference (ACC), pages 4832–4839, 2019.doi:10.23919/ACC.2019.8815339

  8. [8]

    Mardt, L

    A. Mardt, L. Pasquali, H. Wu, and F. No´ e. VAMPnets for deep learning of molecular kinetics.Nature communications, 9(1):5, 2018.doi:10.1038/s41467-017-02388-1

  9. [9]

    M. Gori, A. Tesi, et al. On the problem of local minima in backpropagation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.doi: 10.1109/34.107014

  10. [10]

    H. A. B. Te Braake and G. Van Straten. Random activation weight neural net (RAWN) for fast non-iterative training.Engineering Applications of Artificial Intel- ligence, 8(1):71–80, 1995.doi:10.1016/0952-1976(94)00056-S

  11. [11]

    Zhang and P

    L. Zhang and P. N. Suganthan. A survey of randomized algorithms for training neural networks.Information Sciences, 364:146–155, 2016.doi:10.1016/j.ins.2016.01. 039

  12. [12]

    Malik, R

    A.K. Malik, R. Gao, M. A. Ganaie, M. Tanveer, and P.N. Suganthan. Random vector functional link network: recent developments, applications, and future directions. Applied Soft Computing, 143:110377, 2023.doi:10.1016/j.asoc.2023.110377

  13. [13]

    Tabish, B

    M. Tabish, B. Leimkuhler, and S. Klus. How deep is your network? Deep vs. shallow learning of transfer operators.arXiv preprint arXiv:2509.19930, 2025

  14. [14]

    I. Mezi´ c. Spectral properties of dynamical systems, model reduction and decomposi- tions.Nonlinear Dynamics, 41:309–325, 2005.doi:10.1007/s11071-005-2824-x

  15. [15]

    Sch¨ utte and M

    C. Sch¨ utte and M. Sarich.Metastability and Markov state models in molecular dy- namics, volume 24. American Mathematical Soc., 2013. URL:https://bookstore. ams.org/cln-24

  16. [16]

    Froyland

    G. Froyland. An analytic framework for identifying finite-time coherent sets in time- dependent dynamical systems.Physica D: Nonlinear Phenomena, 250:1–19, 2013. doi:10.1016/j.physd.2013.01.013. 14

  17. [17]

    Banisch and P

    R. Banisch and P. Koltai. Understanding the geometry of transport: Diffusion maps for Lagrangian trajectory data unravel coherent sets.Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(3), 2017.doi:10.1063/1.4971788

  18. [18]

    Koltai, H

    P. Koltai, H. Wu, F. No´ e, and C. Sch¨ utte. Optimal data-driven estimation of gen- eralized Markov state models for non-equilibrium dynamics.Computation, 6(1):22, 2018.doi:10.3390/computation6010022

  19. [19]

    P. I. Frazier. A tutorial on bayesian optimization, 2018. URL:https://arxiv.org/ abs/1807.02811,arXiv:1807.02811

  20. [20]

    Wu and F

    H. Wu and F. No´ e. Variational approach for learning Markov processes from time series data.Journal of Nonlinear Science, 30(1):23–66, 2020.doi:10.1007/ s00332-019-09567-y

  21. [21]

    Learning graphons from data: Random walks, transfer operators, and spectral clustering

    S. Klus and J. J. Bramburger. Learning graphons from data: Random walks, transfer operators, and spectral clustering.arXiv preprint arXiv:2507.18147, 2025.doi: 10.48550/arXiv.2507.18147

  22. [22]

    I. I. Rypina, M. G. Brown, F. J. Beron-Vera, H. Ko¸ cak, M. J. Olascoaga, and I. A. Udovydchenkov. On the Lagrangian dynamics of atmospheric zonal jets and the permeability of the stratospheric polar vortex.Journal of the Atmospheric Sciences, 64(10):3595–3610, 2007.doi:10.1175/JAS4036.1

  23. [23]

    Hoffmann, M

    M. Hoffmann, M. Scherer, T. Hempel, A. Mardt, B. de Silva, B. E. Husic, S. Klus, H. Wu, N. Kutz, S.L. Brunton, et al. Deeptime: a Python library for machine learning dynamical models from time series data.Machine Learning: Science and Technology, 3(1):015009, 2021.doi:10.1088/2632-2153/ac3de0

  24. [24]

    Sch¨ utte, S

    C. Sch¨ utte, S. Klus, and C. Hartmann. Overcoming the timescale barrier in molecular dynamics: Transfer operators, variational principles and machine learning.Acta Numerica, 32:517–673, 2023.doi:10.1017/S0962492923000016

  25. [25]

    Lindorff-Larsen, S

    K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw. How fast-folding proteins fold.Science, 334(6055):517–520, 2011.doi:10.1126/science.1208351. 15