Self-Optimizing Control of Continuous Processes Based on Reinforcement Learning

Hongye Su; Junghui Chen; Lei Xie; Ziqi Zhuo

arxiv: 2606.04471 · v1 · pith:CGW7RX3Rnew · submitted 2026-06-03 · 📡 eess.SY · cs.SY

Self-Optimizing Control of Continuous Processes Based on Reinforcement Learning

Ziqi Zhuo , Junghui Chen , Lei Xie , Hongye Su This is my paper

Pith reviewed 2026-06-28 05:23 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords self-optimizing controlreinforcement learningcontinuous stirred-tank reactoractor networkdynamic performanceonline fine-tuningmodel mismatch

0 comments

The pith

Reinforcement learning embeds self-optimizing control structures in an actor network to optimize variables through environment interaction without explicit constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an RL-based method for self-optimizing control in continuous industrial processes to handle high-frequency disturbances better than steady-state data approaches. The controlled variable structure is placed inside the actor network while rewards are defined from economic performance, allowing the agent to learn suitable variables by direct interaction. This setup is said to implicitly respect implementability and steady-state uniqueness, avoiding the need for added regularization or constraints. Online fine-tuning is added to handle model mismatch. Experiments on a continuous stirred-tank reactor show smoother outputs, stronger disturbance rejection, simpler tuning, and easier adaptation compared with the baseline method.

Core claim

Embedding the self-optimizing control controlled-variable structure inside the actor network of a reinforcement-learning agent, together with economic-indicator reward functions, enables the agent to discover and optimize controlled variables through environment interaction while implicitly accounting for implementability and steady-state uniqueness; online fine-tuning then compensates for model mismatch and yields improved dynamic performance under real-time disturbances.

What carries the argument

Actor network that directly incorporates the SOC controlled-variable selection structure, trained by policy gradients on economic rewards.

If this is right

Dynamic performance improves under real-time disturbances relative to steady-state data methods.
Controlled-variable trajectories remain smooth without added regularization penalties.
Hyperparameter tuning effort decreases because the structure is learned rather than hand-designed.
Online fine-tuning restores performance when the process model changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding could allow control redesign when operating conditions shift without restarting from a full steady-state optimization.
Removing explicit constraints may reduce the engineering effort needed when moving to processes with many candidate controlled variables.
The approach suggests a route toward fully model-free SOC when combined with further online adaptation.

Load-bearing premise

Interaction with the environment lets the RL agent discover controlled variables that are both implementable and lead to unique steady states without any explicit constraints or regularization terms.

What would settle it

Running the same continuous stirred-tank reactor experiments and finding that the RL controller produces larger output oscillations or slower disturbance rejection than the objective-guided steady-state method would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.04471 by Hongye Su, Junghui Chen, Lei Xie, Ziqi Zhuo.

**Figure 2.** Figure 2: CSTR Consider a disturbance applied to the system: Ti(t) ∼ N (µTi = 66.0, σ2 Ti = 8) (10) where Ti denotes the feed temperature. Parameters in the dynamic equations are listed in Table I. TABLE I SIMULATION PARAMETERS Parameter Value Unit Description dt 10 s Simulation time step V 7.08 m 3 Reactor volume VC 1.82 m 3 Jacket volume F 0.0075 m 3 /s Feed rate FCmax 0.02 m 3 /s Maximum flow α 50 - Valve rangeab… view at source ↗

**Figure 3.** Figure 3: Steady-state performance of the Baseline [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: RL-based SOC vs Baseline vs RTO under the same [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: Training Reward Curve in RL (Offline) C. Our Method: RL-based SOC 1) Offline SOC Design: The RL agent’s state space is defined as the CSTR output measurements y, and the action space corresponds to the valve opening u ∈ [0, 1]. The policy [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 7.** Figure 7: Online Fine-Tuning vs Direct Training V. CONCLUSION In summary, RL-based SOC showed promising performance on the disturbed CSTR, demonstrating that RL can effectively train the nonlinear mapping f(θ, y) and improve dynamic performance under distributed disturbances. The method does not rely on highly sensitive hyperparameters and, under model mismatch, can mitigate economic losses through online finetunin… view at source ↗

**Figure 6.** Figure 6: f(θ, y): RL-based SOC vs Baseline RL-based SOC also exhibited smoother control than the Baseline [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

This paper addresses the Self-Optimizing Control (SOC) problem in industrial continuous processes and proposes a Reinforcement-Learning (RL)-based SOC approach to improve dynamic performance under high-frequency disturbances. In the proposed framework, the SOC controlled variable structure is embedded in the Actor network, and reward functions are designed based on economic indicators. Through interaction with the environment, the RL agent optimizes controlled variables while implicitly considering implementability and steady-state uniqueness. Online fine-tuning is further introduced to alleviate model mismatch. Experiments on a continuous stirred-tank reactor with disturbances compare the proposed RL-based SOC method with the Objective-Guided Controlled Variable Learning Approach based on steady-state data. The results show that the RL method achieves improved dynamic performance under real-time disturbances, generates smooth controlled variable outputs without explicit regularization, reduces hyperparameter-tuning complexity, and enhances adaptability through online adjustment. Overall, the proposed RL-based SOC approach provides an effective solution for nonlinear process control and offers a promising reference for future studies involving multiple disturbances, multiple operating conditions, and model-free scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding the SOC structure in an RL actor with economic rewards and online fine-tuning is a workable applied idea, but the claim of implicit enforcement of implementability and uniqueness has no demonstrated mechanism in the CSTR results.

read the letter

The paper's concrete step is putting the SOC controlled-variable selection inside the actor network, using economic indicators for the reward, and adding online fine-tuning to handle model mismatch. The CSTR simulations then show better disturbance rejection and smoother outputs than the steady-state baseline method.

That combination is new enough as an application and gives a practical route for nonlinear process control without needing separate regularization terms. The online adjustment part is a clear engineering plus for real-time use.

The soft spot is exactly the one the stress-test flags. The abstract states that interaction alone lets the agent implicitly handle implementability and steady-state uniqueness, yet nothing in the reported experiments isolates or verifies that mechanism. The comparison is only to a steady-state method, so it is unclear whether the reported smoothness and gains come from the embedding, the reward shaping, or other unstated training details. Without ablations or explicit checks on variable uniqueness under high-frequency disturbances, that part of the argument stays unproven.

This is for process-control researchers who already work with RL and want a template for SOC-style problems. A reader looking for a ready-to-use industrial method will find the framework suggestive but will need the full methods and any extra diagnostics to judge the implicit-optimization claim.

Send it for peer review so the authors can supply the missing checks on how the implicit constraints actually operate.

Referee Report

1 major / 0 minor

Summary. The paper proposes a reinforcement learning (RL)-based framework for self-optimizing control (SOC) of continuous industrial processes. It embeds the SOC controlled-variable structure directly in the Actor network and designs rewards from economic indicators. The central claim is that interaction with the environment allows the RL agent to optimize controlled variables while implicitly enforcing implementability and steady-state uniqueness, without explicit constraints or regularization; online fine-tuning is added for model mismatch. Simulation experiments on a continuous stirred-tank reactor (CSTR) under real-time disturbances report improved dynamic performance, smoother outputs, and better adaptability compared with an objective-guided steady-state baseline.

Significance. If validated, the approach would provide a data-driven route to SOC that reduces reliance on explicit regularization and hyperparameter tuning while handling high-frequency disturbances. The embedding of SOC structure in the Actor and use of economic rewards represent a concrete integration of RL with process-control objectives, with potential reference value for model-free or multi-condition scenarios.

major comments (1)

[Abstract / Proposed framework] Abstract / Proposed framework: The load-bearing claim that the RL agent 'implicitly considers implementability and steady-state uniqueness' through environment interaction alone (without explicit constraints or regularization) is not supported by a demonstrated mechanism. Neither the reward design nor the Actor embedding is shown to prevent selection of non-unique or non-implementable controlled variables under high-frequency disturbances; the CSTR comparison to the steady-state baseline does not isolate whether reported smoothness and dynamic gains arise from this implicit capability or from unstated factors such as reward shaping or training procedure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract / Proposed framework] Abstract / Proposed framework: The load-bearing claim that the RL agent 'implicitly considers implementability and steady-state uniqueness' through environment interaction alone (without explicit constraints or regularization) is not supported by a demonstrated mechanism. Neither the reward design nor the Actor embedding is shown to prevent selection of non-unique or non-implementable controlled variables under high-frequency disturbances; the CSTR comparison to the steady-state baseline does not isolate whether reported smoothness and dynamic gains arise from this implicit capability or from unstated factors such as reward shaping or training procedure.

Authors: The reward is constructed directly from economic performance indicators of the closed-loop process. Any controlled-variable selection that is non-implementable or yields non-unique steady states necessarily produces inconsistent or suboptimal economic returns when the agent interacts with the full nonlinear dynamics; the policy gradient therefore drives the Actor away from such selections without needing an auxiliary penalty term. The Actor embedding restricts outputs to the exact linear-combination form required by SOC, so the search space itself excludes structurally invalid candidates. In the CSTR experiments the only difference between the RL agent and the steady-state baseline is the online interaction with disturbances; the observed smoothness and improved dynamic metrics therefore arise from the learned policy rather than from unstated reward shaping. We maintain that the mechanism is demonstrated by the training procedure and the empirical outcome, though we acknowledge it is empirical rather than a formal proof. revision: no

Circularity Check

0 steps flagged

No significant circularity in the RL-based SOC derivation

full rationale

The paper proposes embedding the SOC structure in the Actor network and using economic-indicator rewards, with optimization occurring via standard RL environment interaction. This does not reduce to a self-definitional equivalence, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. No uniqueness theorem or ansatz is imported from prior author work, and the method is compared to an external baseline (Objective-Guided Controlled Variable Learning Approach). The derivation chain remains independent of its own outputs and is self-contained against the described external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; the paper relies on standard reinforcement-learning assumptions for control and process-model interaction, but specific free parameters such as reward scaling or network architecture choices are not detailed.

free parameters (1)

RL reward scaling and network hyperparameters
Economic-indicator rewards and actor-critic architecture typically require tuned scaling factors and learning rates whose specific values are not reported in the abstract.

axioms (1)

domain assumption The continuous process can be treated as a Markov decision process suitable for standard RL algorithms.
Implicit in any RL control application; stated as the basis for the actor-critic interaction with the environment.

pith-pipeline@v0.9.1-grok · 5716 in / 1170 out tokens · 31187 ms · 2026-06-28T05:23:05.793998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Alstad, Vidar and Skogestad, Sigurd , langid =. Null
[2]

Wayne , year = 2024, series =

Bequette, B. Wayne , year = 2024, series =. Process Control: Modeling, Design, and Simulation , shorttitle =

2024
[3]

Computers & Chemical Engineering , volume =

Adaptation Strategies for Real-Time Optimization , author =. Computers & Chemical Engineering , volume =. doi:10.1016/j.compchemeng.2009.04.014 , urldate =

work page doi:10.1016/j.compchemeng.2009.04.014 2009
[4]

and Skogestad, Sigurd and Morud, John C

Halvorsen, Ivar J. and Skogestad, Sigurd and Morud, John C. and Alstad, Vidar , year = 2003, month = jul, journal =. Optimal. doi:10.1021/ie020833t , urldate =

work page doi:10.1021/ie020833t 2003
[5]

Optimal Controlled Variables for Polynomial Systems , author =
[6]

Self-Optimizing Control --

J. Self-Optimizing Control --. Annual Reviews in Control , volume =. doi:10.1016/j.arcontrol.2017.03.001 , urldate =

work page doi:10.1016/j.arcontrol.2017.03.001 2017
[7]

Journal of Process Control , volume =

Self-Optimizing Control with Active Set Changes , author =. Journal of Process Control , volume =. doi:10.1016/j.jprocont.2012.02.015 , urldate =

work page doi:10.1016/j.jprocont.2012.02.015 2012
[8]

and Chachuat, B

Marchetti, A. and Chachuat, B. and Bonvin, D. , year = 2007, journal =. doi:10.3182/20070606-3-MX-2915.00006 , urldate =

work page doi:10.3182/20070606-3-mx-2915.00006 2007
[9]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year = 2017, month = aug, number =. Proximal. doi:10.48550/arXiv.1707.06347 , urldate =. arXiv , langid =:1707.06347 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[10]

Generalized

Ye, Lingjian and Cao, Yi and He, Yuchen and Zhou, Chenchen and Su, Hongxin and Tang, Xinhui and Yang, Shuanghua , year = 2023, month = sep, journal =. Generalized. doi:10.1021/acs.iecr.3c01685 , urldate =

work page doi:10.1021/acs.iecr.3c01685 2023
[11]

Ye, Lingjian and Cao, Yi and Yuan, Xiaofeng , year = 2015, month = dec, journal =. Global. doi:10.1021/acs.iecr.5b00844 , urldate =

work page doi:10.1021/acs.iecr.5b00844 2015
[12]

Global Self-Optimizing Control with Active-Set Changes:

Ye, Lingjian and Cao, Yi and Yang, Shuanghua , year = 2022, month = mar, journal =. Global Self-Optimizing Control with Active-Set Changes:. doi:10.1016/j.compchemeng.2022.107662 , urldate =

work page doi:10.1016/j.compchemeng.2022.107662 2022
[13]

Generalized

Zhou, Chenchen and Su, Hongxin and Tang, Xinhui and Cao, Yi and Yang, Shuang-Hua and Ye, Lingjian , year = 2025, month = jan, journal =. Generalized. doi:10.1021/acs.iecr.4c02644 , urldate =

work page doi:10.1021/acs.iecr.4c02644 2025

[1] [1]

Alstad, Vidar and Skogestad, Sigurd , langid =. Null

[2] [2]

Wayne , year = 2024, series =

Bequette, B. Wayne , year = 2024, series =. Process Control: Modeling, Design, and Simulation , shorttitle =

2024

[3] [3]

Computers & Chemical Engineering , volume =

Adaptation Strategies for Real-Time Optimization , author =. Computers & Chemical Engineering , volume =. doi:10.1016/j.compchemeng.2009.04.014 , urldate =

work page doi:10.1016/j.compchemeng.2009.04.014 2009

[4] [4]

and Skogestad, Sigurd and Morud, John C

Halvorsen, Ivar J. and Skogestad, Sigurd and Morud, John C. and Alstad, Vidar , year = 2003, month = jul, journal =. Optimal. doi:10.1021/ie020833t , urldate =

work page doi:10.1021/ie020833t 2003

[5] [5]

Optimal Controlled Variables for Polynomial Systems , author =

[6] [6]

Self-Optimizing Control --

J. Self-Optimizing Control --. Annual Reviews in Control , volume =. doi:10.1016/j.arcontrol.2017.03.001 , urldate =

work page doi:10.1016/j.arcontrol.2017.03.001 2017

[7] [7]

Journal of Process Control , volume =

Self-Optimizing Control with Active Set Changes , author =. Journal of Process Control , volume =. doi:10.1016/j.jprocont.2012.02.015 , urldate =

work page doi:10.1016/j.jprocont.2012.02.015 2012

[8] [8]

and Chachuat, B

Marchetti, A. and Chachuat, B. and Bonvin, D. , year = 2007, journal =. doi:10.3182/20070606-3-MX-2915.00006 , urldate =

work page doi:10.3182/20070606-3-mx-2915.00006 2007

[9] [9]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year = 2017, month = aug, number =. Proximal. doi:10.48550/arXiv.1707.06347 , urldate =. arXiv , langid =:1707.06347 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017

[10] [10]

Generalized

Ye, Lingjian and Cao, Yi and He, Yuchen and Zhou, Chenchen and Su, Hongxin and Tang, Xinhui and Yang, Shuanghua , year = 2023, month = sep, journal =. Generalized. doi:10.1021/acs.iecr.3c01685 , urldate =

work page doi:10.1021/acs.iecr.3c01685 2023

[11] [11]

Ye, Lingjian and Cao, Yi and Yuan, Xiaofeng , year = 2015, month = dec, journal =. Global. doi:10.1021/acs.iecr.5b00844 , urldate =

work page doi:10.1021/acs.iecr.5b00844 2015

[12] [12]

Global Self-Optimizing Control with Active-Set Changes:

Ye, Lingjian and Cao, Yi and Yang, Shuanghua , year = 2022, month = mar, journal =. Global Self-Optimizing Control with Active-Set Changes:. doi:10.1016/j.compchemeng.2022.107662 , urldate =

work page doi:10.1016/j.compchemeng.2022.107662 2022

[13] [13]

Generalized

Zhou, Chenchen and Su, Hongxin and Tang, Xinhui and Cao, Yi and Yang, Shuang-Hua and Ye, Lingjian , year = 2025, month = jan, journal =. Generalized. doi:10.1021/acs.iecr.4c02644 , urldate =

work page doi:10.1021/acs.iecr.4c02644 2025