Goal-oriented safe active learning for predictive control using Bayesian recurrent neural networks

Alessio La Bella; Anna Scampicchio; Johannes K\"ohler; Laura Boca de Giuli; Manish Prajapat; Melanie Zeilinger; Riccardo Scattolini

arxiv: 2604.12542 · v1 · submitted 2026-04-14 · 📡 eess.SY · cs.SY

Goal-oriented safe active learning for predictive control using Bayesian recurrent neural networks

Laura Boca de Giuli , Alessio La Bella , Manish Prajapat , Johannes K\"ohler , Anna Scampicchio , Riccardo Scattolini , Melanie Zeilinger This is my paper

Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords model predictive controlsafe active learningBayesian recurrent neural networksonline model adaptationrecursive feasibilitysafety guaranteeseconomic performanceexploration termination

0 comments

The pith

An MPC framework using Bayesian recurrent neural networks adapts models online through goal-oriented safe active learning while matching full-knowledge economic performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper embeds recursive Bayesian updates of the last-layer parameters of a recurrent neural network inside a model predictive control loop. A goal-oriented safe active learning procedure alternates between an exploration phase that gathers informative data while still optimizing the main objective and a pure goal-reaching phase. Theoretical results establish recursive feasibility of the optimization, high-probability safety, finite-time termination of exploration, and economic performance comparable to an MPC that already knows the full system model. A reader cares because learning-based control must otherwise trade off data collection against the risk of unsafe or costly behavior during adaptation.

Core claim

By recursively updating the last-layer parameters of a Bayesian recurrent neural network within an MPC framework, the proposed goal-oriented safe active learning algorithm alternates between an exploration phase, where the controller seeks informative data while still optimizing the main objective, and a goal-reaching phase focused solely on the objective. This yields recursive feasibility, probabilistic safety, finite-time termination of exploration, and economic performance comparable to an MPC with complete system knowledge, as shown in benchmark energy system simulations.

What carries the argument

The goal-oriented safe active learning algorithm that alternates between exploration and goal-reaching phases using Bayesian posterior updates on RNN parameters to quantify uncertainty and enforce safety.

If this is right

Model accuracy improves progressively without penalizing the primary control objective.
Safety constraints are respected with high probability at every time step.
Exploration terminates after a finite number of steps, after which the controller focuses only on the goal.
The underlying optimization remains recursively feasible throughout operation.
Economic performance approaches that achieved by an MPC with perfect prior knowledge of the system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alternation strategy could apply to other uncertain nonlinear systems where online data collection must not compromise operational limits.
It suggests that Bayesian uncertainty estimates can be used directly inside the MPC cost or constraints to limit conservatism while still guaranteeing safety.
Real hardware deployment would test whether the finite-time termination property holds under model mismatch or sensor noise beyond the benchmark simulations.
The same structure might be used with other probabilistic models besides recurrent neural networks provided their uncertainty can be updated recursively.

Load-bearing premise

The recurrent neural network must be expressive enough to capture the relevant system dynamics and the Bayesian uncertainty estimates must be accurate enough to support safe yet informative exploration decisions.

What would settle it

Closed-loop simulations or experiments in which safety constraints are violated with non-negligible probability, exploration fails to terminate in bounded time, or economic performance remains substantially worse than the full-knowledge MPC would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12542 by Alessio La Bella, Anna Scampicchio, Johannes K\"ohler, Laura Boca de Giuli, Manish Prajapat, Melanie Zeilinger, Riccardo Scattolini.

**Figure 1.** Figure 1: Schematic representation of lower (green) and upper (purple) bounds, and pessimistic (blue) and optimistic (grey) sets. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Schematic representation of the AROMA DHS, showing the heating station, the supply and return pipelines, and the five thermal loads, together with [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Simulation results of the omniscient MPC. (a) Electricity price. (b) Optimised control input (blue) and corresponding constraints (black). (c) Load [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Simulation results of the proposed learning-based MPC. (a) Cost difference [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

A key challenge in learning-based model predictive control (MPC) is to collect informative data online for model adaptation while ensuring safety and without penalising control performance. In this paper, we propose an online model adaptation scheme embedded within an MPC framework in which the last-layer parameters of a recurrent neural network are recursively updated via Bayesian learning. This is achieved by means of a goal-oriented safe active learning algorithm that alternates between an exploration phase, where the MPC actively explores system dynamics to collect informative data for model adaptation while still pursuing the main control objective, and a goal-reaching phase, where it focuses exclusively on the main control objective. The algorithm is complemented with theoretical guarantees of (i) recursive feasibility, (ii) safety, (iii) termination of exploration in finite time, and (iv) close-to-optimal performance. Simulation results on a benchmark energy system demonstrate that the proposed framework achieves economic performance comparable to that of an MPC with full system knowledge, while progressively improving model accuracy and respecting operational safety constraints with high probability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main advance is a two-phase goal-oriented active learning scheme inside MPC that uses Bayesian last-layer RNN updates and claims four explicit guarantees including finite-time exploration termination.

read the letter

The core new piece is the alternation between an exploration phase that still chases the economic objective while probing dynamics and a pure goal-reaching phase, all wrapped in an MPC that updates only the output layer of a pre-trained RNN via Bayesian recursion. This produces a concrete algorithm with stated guarantees on recursive feasibility, high-probability safety, finite termination of exploration, and near-optimal closed-loop performance. The energy-system benchmark simulation shows economic cost close to the full-knowledge MPC case while model accuracy improves and constraints hold with high probability. That combination of structure and claims is not routine in the safe MPC literature they cite.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes an online model adaptation scheme for learning-based MPC using Bayesian RNNs, where only the last-layer parameters are recursively updated. It introduces a goal-oriented safe active learning algorithm that alternates between exploration (to collect informative data while pursuing the control objective) and goal-reaching phases. Theoretical guarantees are provided for recursive feasibility, safety with high probability, finite-time termination of exploration, and close-to-optimal performance. Simulations on a benchmark energy system show economic performance comparable to full-knowledge MPC while respecting safety constraints.

Significance. If the central assumptions hold, the work offers a concrete advance in safe learning-based control by embedding active learning directly in MPC with finite-time guarantees and demonstrated benchmark performance. The explicit handling of exploration termination and the use of Bayesian updates for uncertainty are strengths that could influence practical implementations in energy and process control. However, the significance is tempered by the reliance on fixed hidden features in the RNN.

major comments (3)

[theoretical analysis and safety proof] The safety and recursive feasibility guarantees (detailed in the theoretical analysis) rest on the Bayesian posterior over last-layer weights producing valid high-probability uncertainty sets that contain the true dynamics at every step. Because only the output layer is adapted while hidden-state features remain fixed after initial training, any persistent mismatch between these features and the true nonlinear dynamics can cause the posterior to underestimate uncertainty, violating the safety invariant. This assumption is load-bearing for all four claimed guarantees.
[active learning algorithm and termination proof] The finite-time termination of exploration and close-to-optimal performance claims depend on the exploration threshold parameters, which are free parameters in the algorithm. The manuscript does not provide a parameter-free derivation or explicit bounds showing how these thresholds can be chosen independently of the system to guarantee termination without excessive conservatism or performance loss.
[simulation results] The simulation results claim performance comparable to an MPC with full system knowledge. However, the experimental setup lacks explicit rules for data exclusion, initial training of the RNN feature extractor, and quantification of the probability levels used in the safety constraints, making it difficult to assess whether the results generalize beyond the specific benchmark energy system.

minor comments (3)

[methods] Notation for the Bayesian posterior update and the RNN hidden-state recursion should be introduced earlier and used consistently to improve readability.
[figures] Figure captions in the simulation section would benefit from including the exact probability thresholds and number of Monte Carlo runs used to generate the reported trajectories.
[introduction] A brief discussion of related work on safe exploration in MPC and Bayesian neural network control would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address each of the major comments in detail below, providing clarifications and indicating the revisions we plan to implement to improve the paper.

read point-by-point responses

Referee: The safety and recursive feasibility guarantees (detailed in the theoretical analysis) rest on the Bayesian posterior over last-layer weights producing valid high-probability uncertainty sets that contain the true dynamics at every step. Because only the output layer is adapted while hidden-state features remain fixed after initial training, any persistent mismatch between these features and the true nonlinear dynamics can cause the posterior to underestimate uncertainty, violating the safety invariant. This assumption is load-bearing for all four claimed guarantees.

Authors: We appreciate this insightful observation regarding the foundational assumptions of our theoretical analysis. Our approach relies on the initial offline training of the RNN to learn a sufficiently expressive feature representation of the system dynamics, after which the Bayesian update on the last-layer weights provides high-probability bounds conditional on these features. While we acknowledge that significant persistent mismatch could potentially lead to underestimation, the theoretical guarantees are derived under the assumption that the true dynamics are well-approximated within the feature space plus the quantified uncertainty. To address this, we will revise the manuscript to explicitly state this assumption, discuss conditions for its validity (e.g., through cross-validation of the feature extractor), and note potential limitations along with mitigation strategies such as periodic feature retraining in long-term deployments. This will strengthen the presentation without altering the core results. revision: partial
Referee: The finite-time termination of exploration and close-to-optimal performance claims depend on the exploration threshold parameters, which are free parameters in the algorithm. The manuscript does not provide a parameter-free derivation or explicit bounds showing how these thresholds can be chosen independently of the system to guarantee termination without excessive conservatism or performance loss.

Authors: The exploration thresholds are indeed tunable parameters that balance the exploration-exploitation trade-off in our goal-oriented active learning scheme. While a fully parameter-free guarantee would require stronger assumptions on the system (such as known Lipschitz constants or information gain bounds independent of the model), we can derive explicit relationships between the thresholds, the termination time, and the performance gap. Specifically, the termination proof relies on the cumulative information gain exceeding a threshold related to the desired optimality gap. We will update the theoretical section to include these explicit bounds and provide practical guidelines for selecting the thresholds based on the desired safety probability and performance tolerance, which can be computed from system-specific quantities like the maximum uncertainty reduction per step. This revision will make the parameter selection more transparent and less conservative. revision: yes
Referee: The simulation results claim performance comparable to an MPC with full system knowledge. However, the experimental setup lacks explicit rules for data exclusion, initial training of the RNN feature extractor, and quantification of the probability levels used in the safety constraints, making it difficult to assess whether the results generalize beyond the specific benchmark energy system.

Authors: We agree that additional details are necessary for reproducibility and to evaluate generalizability. In the revised manuscript, we will expand the simulation section to specify: (i) the initial training procedure, including the dataset size, training epochs, and validation method for the RNN feature extractor; (ii) data handling rules, where all collected data points are incorporated into the Bayesian update without exclusion, but with a forgetting factor for older data if applicable; and (iii) the exact probability levels used (e.g., 95% for the uncertainty sets in safety constraints, as per the high-probability guarantees). These additions will allow readers to better replicate and assess the results on other systems. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claimed guarantees of recursive feasibility, safety with high probability, finite-time exploration termination, and near-optimal performance are derived from standard MPC shifting arguments for recursive feasibility and from the maintained assumption that Bayesian posterior uncertainty sets contain the true dynamics with the stated probability. These steps rely on external properties of Bayesian learning and MPC theory rather than reducing any prediction or result to a quantity defined by the paper's own fitted parameters or self-citations. The RNN last-layer adaptation and goal-oriented active learning scheme are presented as algorithmic choices whose performance is validated in simulation, without the central claims being forced by construction from the inputs. No load-bearing self-citation chain or self-definitional reduction is exhibited.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions from MPC and Bayesian learning rather than new free parameters or invented entities; limited information is available from the abstract alone.

free parameters (1)

exploration threshold parameters
Parameters that decide when to switch between exploration and goal-reaching phases; these are chosen or tuned for the specific application.

axioms (2)

domain assumption The RNN model can approximate the true system dynamics sufficiently well for MPC predictions
Required for both the control law and the active learning decisions to be meaningful.
domain assumption Bayesian posterior updates on the last layer provide reliable uncertainty bounds for safety constraints
Central to guaranteeing safety during the exploration phase.

pith-pipeline@v0.9.0 · 5504 in / 1389 out tokens · 67607 ms · 2026-05-10T15:01:42.527544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Anderson, J. A. (1995),An introduction to neural networks, MIT press. Boca de Giuli, L., La Bella, A., De Nicolao, G. and Scattolini, R. (2024), Lifelong learning for monitoring and adaptation of data-based dynamical models: a statistical process control approach,in‘2024 European Control Conference (ECC)’, IEEE, pp. 947–952. Boca de Giuli, L., La Bella, A...

work page arXiv 1995
[2]

(2018), ‘Stochastic model predictive control with active uncertainty learning: A survey on dual control’,Annual Reviews in Control45, 107–117

Mesbah, A. (2018), ‘Stochastic model predictive control with active uncertainty learning: A survey on dual control’,Annual Reviews in Control45, 107–117. Morari, M., Garcia, C. E. and Prett, D. M. (1988), ‘Model predictive control: Theory and practice’,IFAC Proceedings Volumes 21(4), 1–12. Morato, M. M. and Felix, M. S. (2024), ‘Data science and model pre...

work page arXiv 2018
[3]

Proof of Lemma 1 The proof draws inspiration from (Lew et al., 2022, Theorem 1)

A. Proof of Lemma 1 The proof draws inspiration from (Lew et al., 2022, Theorem 1). In detail, we need to prove that, with probability at least 1−δand for allk∈N >0: |θ⋆⊤x− ¯θ⊤ k x| ≤β kΣk(x). (24) To do so, from (5c) we obtain the update for ¯θk: ¯θk = Λ−1 k (xk˜y⋆ k +x k−1˜y⋆ k−1 +. . .+ Λ 0¯θ0). (25) Now we define the auxiliary matrices storing measure...

work page 2022
[4]

Thanks to (12d) and (13d), the control actionsu e 0:h⋆−1|k from problem (12) andu p 0|k from (13), respectively, ensure that ¯θ⊤ k x∈ Y= [y min, ymax]for allx∈ X p k

and recursive feasibility is proved for the three optimisation problems.□ (2)Safety. Thanks to (12d) and (13d), the control actionsu e 0:h⋆−1|k from problem (12) andu p 0|k from (13), respectively, ensure that ¯θ⊤ k x∈ Y= [y min, ymax]for allx∈ X p k . By Corollary 1, it holds thatθ ⋆⊤x∈ Yas well with probability at least 1−δ. □ (3)Finite termination of e...

work page 2022

[1] [1]

Anderson, J. A. (1995),An introduction to neural networks, MIT press. Boca de Giuli, L., La Bella, A., De Nicolao, G. and Scattolini, R. (2024), Lifelong learning for monitoring and adaptation of data-based dynamical models: a statistical process control approach,in‘2024 European Control Conference (ECC)’, IEEE, pp. 947–952. Boca de Giuli, L., La Bella, A...

work page arXiv 1995

[2] [2]

(2018), ‘Stochastic model predictive control with active uncertainty learning: A survey on dual control’,Annual Reviews in Control45, 107–117

Mesbah, A. (2018), ‘Stochastic model predictive control with active uncertainty learning: A survey on dual control’,Annual Reviews in Control45, 107–117. Morari, M., Garcia, C. E. and Prett, D. M. (1988), ‘Model predictive control: Theory and practice’,IFAC Proceedings Volumes 21(4), 1–12. Morato, M. M. and Felix, M. S. (2024), ‘Data science and model pre...

work page arXiv 2018

[3] [3]

Proof of Lemma 1 The proof draws inspiration from (Lew et al., 2022, Theorem 1)

A. Proof of Lemma 1 The proof draws inspiration from (Lew et al., 2022, Theorem 1). In detail, we need to prove that, with probability at least 1−δand for allk∈N >0: |θ⋆⊤x− ¯θ⊤ k x| ≤β kΣk(x). (24) To do so, from (5c) we obtain the update for ¯θk: ¯θk = Λ−1 k (xk˜y⋆ k +x k−1˜y⋆ k−1 +. . .+ Λ 0¯θ0). (25) Now we define the auxiliary matrices storing measure...

work page 2022

[4] [4]

Thanks to (12d) and (13d), the control actionsu e 0:h⋆−1|k from problem (12) andu p 0|k from (13), respectively, ensure that ¯θ⊤ k x∈ Y= [y min, ymax]for allx∈ X p k

and recursive feasibility is proved for the three optimisation problems.□ (2)Safety. Thanks to (12d) and (13d), the control actionsu e 0:h⋆−1|k from problem (12) andu p 0|k from (13), respectively, ensure that ¯θ⊤ k x∈ Y= [y min, ymax]for allx∈ X p k . By Corollary 1, it holds thatθ ⋆⊤x∈ Yas well with probability at least 1−δ. □ (3)Finite termination of e...

work page 2022