The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

Changliu Liu; Masayoshi Tomizuka; Shengbo Eben Li; Yujie Yang; Zhilong Zheng

arxiv: 2404.10064 · v2 · submitted 2024-04-15 · 📡 eess.SY · cs.SY

The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

Yujie Yang , Zhilong Zheng , Masayoshi Tomizuka , Changliu Liu , Shengbo Eben Li This is my paper

Pith reviewed 2026-05-24 01:42 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords feasibility theoryconstrained reinforcement learningmodel predictive controlfeasible regionsvirtual-time domainsafety constraintspolicy feasibilityfeasibility function

0 comments

The pith

Decoupling policy solving into virtual-time and real-time domains defines feasible regions for arbitrary reinforcement learning policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a feasibility theory for constrained reinforcement learning that applies to arbitrary policies rather than only optimal ones. It achieves this by splitting policy solving into a virtual-time domain and implementation into a real-time domain. This split permits separate definitions of initial and endless feasible regions for states and policies, along with their containment relationships that fully characterize the feasible region of any given policy. A sympathetic reader would care because reinforcement learning improves policies iteratively, so constraint satisfaction must be analyzed for every intermediate policy, not just the final one. The theory also supplies virtual-time constraint design rules and a feasibility function that enlarges the feasible region as much as possible.

Core claim

The central claim is that decoupling policy solving and implementation into virtual-time and real-time domains allows definitions of initial and endless feasible regions in both state and policy spaces. The containment relationships among these regions describe the feasible region of an arbitrary policy. The same framework applies to model predictive control and reinforcement learning. Virtual-time constraint design rules and a feasibility function are provided to achieve the maximum feasible region, and most existing constraint formulations are shown to be applications of feasibility functions in different forms.

What carries the argument

The decoupling of policy solving into a virtual-time domain and implementation into a real-time domain, which enables separate definitions of initial/endless and state/policy feasible regions whose containments describe feasibility for any policy.

If this is right

The feasible region of any policy is completely characterized by the containment relations between its initial and endless state and policy feasible regions.
Virtual-time constraint design using feasibility functions can be used to achieve the largest possible feasible region.
Existing constraint formulations in the literature are essentially different realizations of feasibility functions.
Feasible regions for MPC and RL policies can be compared directly through visualization in tasks such as emergency braking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could support RL training procedures that monitor and enlarge feasible regions at each iteration rather than waiting for optimality.
The same virtual/real-time split might be applied to feasibility analysis in other iterative optimization settings in control.
Safety monitoring during learning becomes feasible without assuming the policy has already converged to optimality.

Load-bearing premise

That separating policy solving into a virtual-time domain and implementation into a real-time domain is a valid modeling choice that captures the feasibility properties of non-optimal policies without introducing artifacts.

What would settle it

An experiment or calculation in which an arbitrary policy's observed constraint-satisfying behavior in a control task deviates from the predictions made by the containment relationships among the defined feasible regions.

read the original abstract

Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reframes feasibility for non-optimal RL policies via a virtual/real-time split on top of MPC ideas, with clean definitions and a review of prior methods, but the advance is mostly organizational.

read the letter

The core contribution is a set of definitions that let you talk about feasibility for any policy, not just the optimal one that MPC theories usually assume. By splitting policy improvement into a virtual-time domain and execution into real time, they define initial versus endless feasibility, plus state and policy feasible regions, then map the containment relations between them. That structure is what lets them describe the feasible set for an arbitrary intermediate policy. They also give design rules for virtual-time constraints and introduce a feasibility function that they claim unifies most existing constraint formulations as special cases. The emergency-braking example shows the regions visually for both MPC and RL policies, which helps make the abstractions concrete. The writing is tutorial-style and the containment diagrams are straightforward to follow. The main limitation is that the framework inherits the modeling assumptions of the virtual/real split; if that split introduces artifacts for certain RL algorithms or environments, the containment claims could be less tight than presented. The paper does not include new empirical benchmarks beyond the visualization, so it is hard to judge how much tighter the feasible regions become in practice compared with standard barrier or penalty methods. This is useful reading for people already working on constrained MPC or safe RL who want a more systematic language for intermediate policies. It is not a foundational shift, but the organization is clear enough that a referee could give targeted feedback on the definitions and the unification claim. I would send it out for review rather than desk-reject.

Referee Report

2 major / 3 minor

Summary. The manuscript develops a feasibility theory for constrained optimal control problems that applies to both model predictive control (MPC) and reinforcement learning (RL). It decouples policy solving (virtual-time domain) from implementation (real-time domain) to define initial and endless feasible regions for states and policies, analyzes their containment relationships to characterize the feasible region of arbitrary (including non-optimal) policies, introduces virtual-time constraint design rules and a feasibility function tool claimed to maximize the feasible region, reviews existing constraint formulations as instances of this function, and demonstrates the concepts via visualizations in an emergency braking task under MPC and RL policies.

Significance. If the virtual/real-time decoupling and derived containment relations hold without introducing artifacts for non-optimal policies, the work would provide a unified framework addressing the gap in feasibility analysis for iterative RL policies, extending beyond MPC's focus on optimal policies. The synthesis of existing constraint methods as feasibility functions and the tutorial-style demonstration add practical value for constraint design in safety-critical control.

major comments (2)

[§3] §3 (virtual-time domain definitions): The central decoupling of policy solving into virtual time is load-bearing for all subsequent containment claims and feasible-region descriptions, yet the manuscript provides no formal argument or counterexample analysis showing that this modeling choice preserves feasibility properties for non-optimal intermediate RL policies without artifacts (cf. the emergency-braking demonstration in §6).
[§4] §4 (containment relationships): The claim that the analyzed containments fully describe the feasible region of an arbitrary policy assumes the endless feasible region is well-defined and non-empty under the virtual-time optimal policy; this step is not shown to hold when the real-time policy is a non-converged RL iterate, which is the motivating case.

minor comments (3)

The abstract and introduction would benefit from one or two key equations or definitions to make the decoupling and feasibility-function concept concrete for readers.
Notation for the four feasible-region types (initial/endless × state/policy) is introduced without a summary table; adding one would improve readability when discussing containments.
[§6] In the emergency-braking example (§6), the figures visualize regions but lack quantitative metrics (e.g., area ratios or violation rates) to support the qualitative containment claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing clarifications on the virtual-time decoupling and containment relations while noting where additional discussion can be incorporated.

read point-by-point responses

Referee: [§3] §3 (virtual-time domain definitions): The central decoupling of policy solving into virtual time is load-bearing for all subsequent containment claims and feasible-region descriptions, yet the manuscript provides no formal argument or counterexample analysis showing that this modeling choice preserves feasibility properties for non-optimal intermediate RL policies without artifacts (cf. the emergency-braking demonstration in §6).

Authors: The virtual-time domain is defined as the iterative policy-solving process (distinct from real-time execution) to explicitly accommodate the sequence of non-optimal intermediate policies generated by RL algorithms. Feasibility properties are preserved by construction: virtual-time feasibility evaluates a policy's constraint satisfaction during the solving iterations, while real-time feasibility applies to its deployment. This separation does not introduce artifacts for non-optimal policies because the definitions of initial/endless feasible regions and their containments are policy-agnostic and derived from the same constraint sets. The emergency-braking example in §6 applies the framework directly to RL iterates, confirming consistent region descriptions. A brief clarifying paragraph on this definitional basis can be added to §3. revision: partial
Referee: [§4] §4 (containment relationships): The claim that the analyzed containments fully describe the feasible region of an arbitrary policy assumes the endless feasible region is well-defined and non-empty under the virtual-time optimal policy; this step is not shown to hold when the real-time policy is a non-converged RL iterate, which is the motivating case.

Authors: The endless feasible region is defined relative to the virtual-time optimal policy as the reference maximum region (consistent with standard MPC feasibility assumptions). For an arbitrary real-time policy (including non-converged RL iterates), the containment relations in §4 characterize its feasible region as a subset without requiring the real-time policy itself to achieve the endless region. This is the core contribution for handling iterative RL policies. The motivating case is covered because the relations apply to any policy by definition, independent of convergence status. No additional assumption is needed beyond the existence of a virtual-time optimum, which is standard in the OCP setup. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a modeling choice (virtual-time vs. real-time decoupling) and then defines new concepts (initial/endless feasibility, state/policy feasible regions) and analyzes containment relations within that framework. This is a standard definitional construction for extending MPC feasibility ideas to arbitrary RL policies; the 'description' of feasible regions follows directly from the introduced definitions rather than reducing any claim to an input by construction. No equations, fitted parameters, or self-citations appear as load-bearing steps in the abstract or stated claims. Existing constraint methods are reinterpreted as instances of the new feasibility function, but this is presented as unification rather than renaming without independent content. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the modeling choice of decoupling into virtual-time and real-time domains and on the validity of the resulting containment relationships; no numerical free parameters, ad-hoc axioms, or invented physical entities are mentioned.

axioms (1)

domain assumption Decoupling policy solving and implementation into separate virtual-time and real-time domains is a valid modeling choice that preserves feasibility properties for non-optimal policies.
This separation is the basis for defining initial/endless and state/policy feasible regions and their containment relationships.

pith-pipeline@v0.9.0 · 5823 in / 1379 out tokens · 20539 ms · 2026-05-24T01:42:53.427001+00:00 · methodology

The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)