pith. sign in

arxiv: 2605.29013 · v1 · pith:MQ5DYJBFnew · submitted 2026-05-27 · 📡 eess.SY · cs.SY

Local Observability and Moving Horizon Estimation-based Training of Feedforward Neural Networks

Pith reviewed 2026-06-29 10:19 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords feedforward neural networksmoving horizon estimationlocal observabilityReLU activationpersistently exciting inputsweight trainingdynamical systemsconvergence guarantees
0
0 comments X

The pith

For two-layer ReLU networks with fixed output weights, a sufficient condition makes the weight-state dynamical system locally observable and supplies convergence guarantees for moving-horizon training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reformulates a feedforward neural network with ReLU activations as a dynamical system whose state vector consists exactly of the network weights. It then derives a sufficient condition under which the observability rank condition holds for two-layer networks whose output-layer weights remain fixed, establishing local observability of that state. This local observability supplies convergence guarantees for a moving-horizon estimation training procedure that updates only the projection of the weights onto the observable subspace from a fixed-length window of input-output data. The same analysis shows that multi-layer networks generally fail the rank condition. The reformulation therefore supplies a control-theoretic route to training with explicit guarantees rather than relying solely on empirical optimization.

Core claim

By treating the weights of an FNN as the state of a discrete-time dynamical system, the authors obtain a sufficient condition under which the observability rank condition holds for two-layer networks with fixed output weights. The resulting local observability allows construction of persistently exciting inputs that render the state distinguishable from its neighbors and, in turn, guarantees convergence of an MHE-based training algorithm that updates only the observable component of the state using a sliding window of input-output pairs.

What carries the argument

Reformulation of the FNN as a state-space dynamical system whose state is the vector of weights, analyzed via the observability rank condition.

If this is right

  • Multi-layer FNNs in general fail to satisfy the observability rank condition.
  • A persistently exciting input design renders the weight state distinguishable from neighbors.
  • MHE training updates only the projection of the state onto the observable subspace.
  • Convergence guarantees hold for the resulting MHE-based training when the observability condition is met.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same observability analysis could be attempted for activation functions other than ReLU if an analogous state-space model can be written.
  • The approach suggests that training procedures for networks embedded in feedback loops might inherit stability properties from the underlying estimator.
  • Extension to time-varying output weights would require a different state definition and new rank conditions.

Load-bearing premise

The assumption that the feedforward network can be exactly represented as a dynamical system whose state vector is the weight vector and that chosen inputs render that state distinguishable from neighbors.

What would settle it

Numerical computation of the observability matrix for a two-layer network satisfying the derived sufficient condition; if its rank falls below the dimension of the observable subspace, or if MHE training on persistently exciting data fails to recover the target weights, the claim is refuted.

Figures

Figures reproduced from arXiv: 2605.29013 by Matthias A. M\"uller, Victor G. Lopez, Yi Yang.

Figure 1
Figure 1. Figure 1: Architecture of FNNs Consider L-layer fully connected FNNs as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A specific class of two-layer FNNs single output.1 Let wi,j denote the weight from the ith node in the input layer to the jth node in the hidden layer, and wj := [w1,j , w2,j , . . . , wm,j ] ⊤, j ∈ Z[1,n] denote the weights from the whole input layer to the jth node of the hidden layer. Then, the weights from the input layer to the hidden layer are denoted by W := [w1, w2, . . . , wn] ∈ R m×n, as shown in… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of loss between MHE-based training method and the [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The error between the ideal weights and the estimate weights. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

In this paper, we propose a moving horizon estimation (MHE)-based training method for feedforward neural networks (FNNs) with rectified linear unit (ReLU) activation functions to determine their ideal weights from a control-theoretic perspective. This allows for a rigorous theoretical analysis of the trained network. First, we reformulate the FNN as a dynamical system with the weights as states. Then, we investigate the local observability of such a system. For two-layer FNNs with fixed output weights, we derive a sufficient condition under which the observability rank condition holds, ensuring a locally observable state. We also show that multi-layer FNNs in general fail to satisfy the observability rank condition. Based on this analysis, we develop a persistently exciting (PE) input design method, which renders a state distinguishable from its neighbors. The resulting local observability provides convergence guarantees for the proposed MHE-based training, where only the projection of the state onto the observable subspace is updated using a fixed-length window of input-output data. The effectiveness of the approach is illustrated via numerical examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reformulates two-layer ReLU feedforward neural networks as discrete-time dynamical systems whose state is the vector of weights, derives a sufficient condition under which the observability rank condition holds when output weights are fixed (ensuring local observability), shows that multi-layer FNNs generally fail this condition, designs persistently exciting inputs to make states distinguishable, and uses the resulting local observability to provide convergence guarantees for an MHE-based training procedure that updates only the observable subspace projection from finite input-output windows.

Significance. If the observability analysis and convergence claims hold, the work supplies a control-theoretic foundation with explicit guarantees for training a restricted class of ReLU networks, which is a substantive contribution at the intersection of nonlinear systems theory and neural network training. The explicit PE input design and the negative result for multi-layer networks are useful technical contributions.

major comments (2)
  1. [observability analysis for two-layer FNNs (section deriving the sufficient condition)] The central claim that a sufficient condition exists under which the observability rank condition holds (abstract and the paragraph on two-layer FNNs) relies on the standard nonlinear observability rank condition, which requires the output map to be at least C^1. The output equation is y(k) = W2 · ReLU(W1 u(k) + b); ReLU is only piecewise differentiable and its Jacobian is undefined wherever any hidden pre-activation is exactly zero. The manuscript must specify whether the rank condition is evaluated using Clarke subdifferentials, one-sided derivatives, or by restricting to open sets away from the kink set, and whether the derived sufficient condition remains valid on a positive-measure set of states and inputs.
  2. [MHE training and convergence guarantees section] The convergence guarantees for the MHE-based training rest on local observability of the weight-state system. If the rank condition derivation does not rigorously handle the non-differentiable points, the local observability claim (and therefore the MHE convergence statement) is not established for generic data; the paper should provide an explicit statement of the set of (u, x) on which the guarantees apply.
minor comments (1)
  1. [abstract and multi-layer discussion] The abstract states that multi-layer FNNs 'in general fail to satisfy the observability rank condition'; a brief remark on whether this holds only for the standard Lie-derivative formulation or also for generalized notions would clarify the scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the differentiability assumptions underlying the observability analysis. We address each major comment below and will revise the manuscript to improve technical precision.

read point-by-point responses
  1. Referee: [observability analysis for two-layer FNNs] The central claim that a sufficient condition exists under which the observability rank condition holds relies on the standard nonlinear observability rank condition, which requires the output map to be at least C^1. The output equation is y(k) = W2 · ReLU(W1 u(k) + b); ReLU is only piecewise differentiable and its Jacobian is undefined wherever any hidden pre-activation is exactly zero. The manuscript must specify whether the rank condition is evaluated using Clarke subdifferentials, one-sided derivatives, or by restricting to open sets away from the kink set, and whether the derived sufficient condition remains valid on a positive-measure set of states and inputs.

    Authors: We agree that the standard observability rank condition assumes a C^1 map. Our derivation applies the rank condition only on open sets where all hidden pre-activations are nonzero, rendering ReLU locally affine (hence C^infty) and the output map differentiable. The sufficient condition is stated for generic weights and inputs that place the trajectory in such regions. These open sets have positive Lebesgue measure. We will revise Section 3 to explicitly note that the analysis is restricted to the complement of the kink set and that the condition holds on a positive-measure set. revision: yes

  2. Referee: [MHE training and convergence guarantees section] The convergence guarantees for the MHE-based training rest on local observability of the weight-state system. If the rank condition derivation does not rigorously handle the non-differentiable points, the local observability claim (and therefore the MHE convergence statement) is not established for generic data; the paper should provide an explicit statement of the set of (u, x) on which the guarantees apply.

    Authors: We concur that an explicit characterization is required. The local observability (and therefore the MHE convergence guarantees) holds on the open set of (u, x) pairs for which no hidden pre-activation is zero, i.e., the complement of the kink set. This set is open and dense for generic data. We will add a precise statement of this domain in the MHE convergence theorem and the associated discussion. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses standard nonlinear observability rank condition

full rationale

The paper reformulates the two-layer ReLU FNN as the autonomous system x(k+1)=x(k), y(k)=W2·ReLU(W1 u(k)+b) with weights as state, then derives a sufficient condition under which the observability rank condition holds. This step invokes the standard Lie-derivative or Jacobian-based rank test from nonlinear systems theory (external to the paper) and does not reduce any claimed guarantee to a fitted parameter, self-citation chain, or redefinition of the target quantity. The subsequent PE input design and MHE convergence statement follow directly from that rank condition without circular closure. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the dynamical-system reformulation and on the existence of a sufficient condition for the observability rank condition; these are domain assumptions drawn from control theory rather than new postulates.

axioms (2)
  • domain assumption A feedforward neural network with ReLU activations can be exactly recast as a discrete-time dynamical system whose state vector contains the network weights.
    This reformulation is the prerequisite for applying observability analysis; stated in the first paragraph of the abstract.
  • standard math The observability rank condition is a valid test for local observability of the resulting nonlinear state-space model.
    Standard result from nonlinear control theory invoked without proof in the abstract.

pith-pipeline@v0.9.1-grok · 5728 in / 1418 out tokens · 32104 ms · 2026-06-29T10:19:06.817342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    On recurrent neural networks for learning-based control: recent results and ideas for future developments,

    F. Bonassi, M. Farina, J. Xie, and R. Scattolini, “On recurrent neural networks for learning-based control: recent results and ideas for future developments,”Journal of Process Control, vol. 114, pp. 92–104, 2022

  2. [2]

    Neural networks for control systems—a survey,

    K. J. Hunt, D. Sbarbaro, R. ˙Zbikowski, and P. J. Gawthrop, “Neural networks for control systems—a survey,”Automatica, vol. 28, no. 6, pp. 1083–1112, 1992

  3. [3]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017

  4. [4]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  5. [5]

    Approximate non- linear model predictive control with safety-augmented neural networks,

    H. Hose, J. K ¨ohler, M. N. Zeilinger, and S. Trimpe, “Approximate non- linear model predictive control with safety-augmented neural networks,” IEEE Transactions on Control Systems Technology, vol. 33, no. 6, pp. 2490–2497, 2025

  6. [6]

    Safe and efficient model predictive control using neural networks: An interior point approach,

    D. Tabas and B. Zhang, “Safe and efficient model predictive control using neural networks: An interior point approach,” in2022 IEEE 61st Conference on Decision and Control (CDC), 2022, pp. 1142–1147

  7. [7]

    Using stochastic programming to train neural network approximation of nonlinear MPC laws,

    Y . Li, K. Hua, and Y . Cao, “Using stochastic programming to train neural network approximation of nonlinear MPC laws,”Automatica, vol. 146, p. 110665, 2022

  8. [8]

    Deep learning in neural networks: an overview,

    J. Schmidhuber, “Deep learning in neural networks: an overview,” Neural Networks, vol. 61, pp. 85–117, 2015

  9. [9]

    Convergence analysis of two-layer neural networks with ReLU activation,

    Y . Li and Y . Yuan, “Convergence analysis of two-layer neural networks with ReLU activation,” inProceedings of the 31st International Confer- ence on Neural Information Processing Systems, 2017, pp. 597–607

  10. [10]

    Training multilayer perceptrons with the ex- tended Kalman algorithm,

    S. Singhal and L. Wu, “Training multilayer perceptrons with the ex- tended Kalman algorithm,” inProceedings of the 2nd International Con- ference on Neural Information Processing Systems, 1988, p. 133–140

  11. [11]

    Recurrent neural network training with convex loss and regularization functions by extended Kalman filtering,

    A. Bemporad, “Recurrent neural network training with convex loss and regularization functions by extended Kalman filtering,”IEEE Transac- tions on Automatic Control, vol. 68, no. 9, pp. 5661–5668, 2022

  12. [12]

    A Lyapunov function for robust stability of moving horizon estimation,

    J. D. Schiller, S. Muntwiler, J. K ¨ohler, M. N. Zeilinger, and M. A. M¨uller, “A Lyapunov function for robust stability of moving horizon estimation,”IEEE Transactions on Automatic Control, vol. 68, no. 12, pp. 7466–7481, 2023

  13. [13]

    Towards lifelong learning of recurrent neural networks for control design,

    F. Bonassi, J. Xie, M. Farina, and R. Scattolini, “Towards lifelong learning of recurrent neural networks for control design,” in2022 European control conference (ECC). IEEE, 2022, pp. 2018–2023

  14. [14]

    An alternative view: when does SGD escape local minima?

    B. Kleinberg, Y . Li, and Y . Yuan, “An alternative view: when does SGD escape local minima?” inInternational Conference on Machine Learning. PMLR, 2018, pp. 2698–2707

  15. [15]

    Uniqueness of weights for neural networks,

    F. Albertini, E. D. Sontag, and V . Maillot, “Uniqueness of weights for neural networks,” inArtificial Neural Networks for Speech and Vision, 1993

  16. [16]

    Parameter identifia- bility of a deep feedforward ReLU neural network,

    J. Bona-Pellissier, F. Bachoc, and F. Malgouyres, “Parameter identifia- bility of a deep feedforward ReLU neural network,”Machine Learning, vol. 112, no. 11, pp. 4431–4493, 2023

  17. [17]

    Nonlinear controllability and observability,

    R. Hermann and A. Krener, “Nonlinear controllability and observability,” IEEE Transactions on Automatic Control, vol. 22, no. 5, pp. 728–740, 1977

  18. [18]

    Observability of autonomous discrete time non-linear sys- tems: a geometric approach,

    H. Nijmeijer, “Observability of autonomous discrete time non-linear sys- tems: a geometric approach,”International Journal of Control, vol. 36, no. 5, pp. 867–874, 1982

  19. [19]

    Remarks on the observability of nonlinear discrete time systems,

    F. Albertini and D. D’Alessandro, “Remarks on the observability of nonlinear discrete time systems,” inSystem Modelling and Optimiza- tion: Proceedings of the Seventeenth IFIP TC7 Conference on System Modelling and Optimization, 1995. Springer, 1996, pp. 155–162

  20. [20]

    A concept of local observability,

    E. D. Sontag, “A concept of local observability,”Systems & Control Letters, vol. 5, no. 1, pp. 41–47, 1984

  21. [21]

    Measures of unobservability,

    A. J. Krener and K. Ide, “Measures of unobservability,” inProceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, 2009, pp. 6401–6406

  22. [22]

    Empirical observability Gramian rank condition for weak observability of nonlinear systems with control,

    N. D. Powel and K. A. Morgansen, “Empirical observability Gramian rank condition for weak observability of nonlinear systems with control,” in2015 54th IEEE Conference on Decision and Control (CDC), 2015, pp. 6342–6348

  23. [23]

    Local identifiability of fully-connected feed-forward networks with nonlinear node dynamics,

    M. Vanelli and J. M. Hendrickx, “Local identifiability of fully-connected feed-forward networks with nonlinear node dynamics,” in2025 Euro- pean Control Conference (ECC). IEEE, 2025, pp. 825–830

  24. [24]

    Local observability of a class of feedforward neural networks,

    Y . Yang, V . G. Lopez, and M. A. M ¨uller, “Local observability of a class of feedforward neural networks,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 90–95

  25. [25]

    A sagemath package for elementary and sign vectors with applications to chemical reac- tion networks,

    M. S. Aichmayr, S. M ¨uller, and G. Regensburger, “A sagemath package for elementary and sign vectors with applications to chemical reac- tion networks,” inInternational Congress on Mathematical Software. Springer, 2024, pp. 155–164

  26. [26]

    The elementary vectors of a subspace ofR n,

    R. T. Rockafellar, “The elementary vectors of a subspace ofR n,” in Combinatorial Mathematics and Its Applications. University of North Carolina Press, 1969, pp. 104–127

  27. [27]

    The sparse basis problem and multilinear algebra,

    R. A. Brualdi, S. Friedland, and A. Pothen, “The sparse basis problem and multilinear algebra,”SIAM Journal on Matrix Analysis and Appli- cations, vol. 16, no. 1, pp. 1–20, 1995

  28. [28]

    Approximation by superpositions of a sigmoidal function,

    G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, 1989

  29. [29]

    Some applications of the pseudoinverse of a matrix,

    T. Greville, “Some applications of the pseudoinverse of a matrix,”SIAM Review, vol. 2, no. 1, pp. 15–22, 1960

  30. [30]

    Abraham, J

    R. Abraham, J. E. Marsden, and T. Ratiu,Manifolds, tensor analysis, and applications. Springer Science & Business Media, 2012, vol. 75

  31. [31]

    Strong convergence of infinite products of orthogonal pro- jections in Hilbert space,

    M. Sakai, “Strong convergence of infinite products of orthogonal pro- jections in Hilbert space,”Applicable Analysis, vol. 59, no. 1-4, pp. 109–120, 1995

  32. [32]

    R. A. Horn and C. R. Johnson,Matrix analysis. Cambridge university press, 2012

  33. [33]

    UCI machine learning repository,

    A. Asuncion, D. Newmanet al., “UCI machine learning repository,” 2007. 15

  34. [34]

    CasADi: a software framework for nonlinear optimization and optimal control,

    J. A. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi: a software framework for nonlinear optimization and optimal control,”Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019

  35. [35]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  36. [36]

    Noisy natural gra- dient as variational inference,

    G. Zhang, S. Sun, D. Duvenaud, and R. Grosse, “Noisy natural gra- dient as variational inference,” inInternational conference on machine learning. PMLR, 2018, pp. 5852–5861. Yi Yangreceived both the B.Eng. and the M.Sc. degrees in control science and engineering from Beijing Institute of Technology, China, in 2021 and 2024, respectively. He is currently w...