Regret-Optimal Control for Finite-State Systems

Oron Sabag; Yishay Polatov

arxiv: 2604.23760 · v1 · submitted 2026-04-26 · 🧮 math.OC

Regret-Optimal Control for Finite-State Systems

Yishay Polatov , Oron Sabag This is my paper

Pith reviewed 2026-05-08 05:44 UTC · model grok-4.3

classification 🧮 math.OC

keywords regret-optimal controldynamic regretfinite-state systemsdynamic programmingRegret-Bellman operatorlookahead benchmarkcausal policiesworst-case regret

0 comments

The pith

A nested dynamic program using the fixed point of the Regret-Bellman operator computes the optimal worst-case dynamic regret and policy for finite-state systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops causal control policies for finite-state systems that minimize worst-case dynamic regret relative to a lookahead benchmark controller. Dynamic regret evaluates performance by comparing the policy directly to the benchmark on each disturbance sequence, giving the benchmark an advantage on easy sequences while limiting exposure on hard ones. This avoids both the i.i.d. disturbance assumption of Markov decision processes and the pure worst-case focus of robust control. The solution proceeds by computing the fixed point of a newly defined Regret-Bellman operator to obtain a value function, then feeding that function into a finite-horizon dynamic program. If correct, the result supplies an exact computational procedure that produces policies interpolating between distribution-dependent and fully robust designs without requiring knowledge of disturbance statistics.

Core claim

For finite-state systems driven by exogenous disturbances, the optimal worst-case dynamic regret and a regret-optimal policy are obtained by a nested dynamic-programming procedure. The procedure first solves for the fixed point of the Regret-Bellman operator to produce a value function that encodes the benchmark-relative regret, then uses this value function inside a finite-horizon dynamic program to extract the optimal policy.

What carries the argument

The Regret-Bellman operator, a dynamic-programming operator whose fixed-point value function encodes worst-case regret against the lookahead benchmark and serves as input to the outer finite-horizon program.

If this is right

Regret-optimal policies can be computed exactly for any finite-state system without prior knowledge of the disturbance distribution.
Such policies interpolate between the performance of MDP-based controllers and robust controllers.
They can outperform both MDP-based and robust policies when disturbances are i.i.d. or possess exploitable structure.
The framework supplies an alternative design method to both classical MDP formulations and worst-case robust control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed-point computation may admit efficient approximation schemes when the state space grows large but remains finite.
The same regret operator could be used to derive regret bounds in online learning settings where the disturbance process is chosen adversarially.
Extending the operator to partial-observation or infinite-horizon discounted settings would require only minor changes to the nesting structure.

Load-bearing premise

The system has finitely many states and the benchmark is a lookahead controller whose performance can be compared to the causal policy via dynamic regret.

What would settle it

A finite-state system and disturbance model for which the value returned by the nested dynamic program differs from the true minimal worst-case dynamic regret achievable by any causal policy, or for which the Regret-Bellman operator has no fixed point.

Figures

Figures reproduced from arXiv: 2604.23760 by Oron Sabag, Yishay Polatov.

**Figure 1.** Figure 1: Performance comparison of causal controllers for the inventory management. view at source ↗

**Figure 2.** Figure 2: Reward over time: λlow = 4, λhigh = 7 view at source ↗

**Figure 3.** Figure 3: Reward over time: λlow = 8, λhigh = 11 view at source ↗

**Figure 4.** Figure 4: Reward over time: λlow = 16, λhigh = 19. First, recall that the regret, for fixed policies and disturbance, is defined as the discounted returns’ difference. We can then derive the identity V (s0, πL, w) − V (s0, πC, w) = X∞ t=0 γ t (r L t − r C t ) = − k X−1 t=0 γ t r C t + X∞ t=0 γ t (r L t − γ k r C t+k ) = − k X−1 t=0 γ t r C t + X∞ t=0 γ t ρk+t. (25) To show (24), we note that a maximization over πL i… view at source ↗

read the original abstract

We study the control of finite-state systems driven by exogenous disturbances, and design causal policies that track the performance of a lookahead benchmark controller. This objective is formalized through dynamic regret, so that favorable disturbance sequences are compared against a strong benchmark, while under adverse disturbance sequences the comparison accounts for the benchmark's degraded performance. This benchmark-relative framework provides an alternative to classical MDP formulations, which assume i.i.d. disturbances, and to robust control approaches, which optimize against worst-case disturbances. Our main result is a nested dynamic-programming solution that computes both the optimal worst-case regret and a regret-optimal policy. In particular, we introduce the Regret-Bellman operator, whose fixed-point value function feeds into a finite-horizon dynamic program. Numerical examples show that regret-optimal policies interpolate nicely between MDP-based and robust controllers without requiring knowledge of the disturbance distribution, and can even outperform both under i.i.d. or structured disturbances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a Regret-Bellman operator and nested DP to compute regret-optimal policies for finite-state systems, but the fixed-point details look underspecified.

read the letter

The main new piece is the Regret-Bellman operator whose fixed point supplies a value function for a finite-horizon DP that yields both the optimal dynamic regret and the corresponding causal policy. This sits between standard MDP formulations that need i.i.d. disturbances and robust control that optimizes purely against the worst case, by instead comparing against a lookahead benchmark via dynamic regret. The numerical examples are useful here: they show the resulting policies interpolate between the two classical approaches and can even beat both when disturbances are i.i.d. or have structure, without needing the distribution in advance. That framing and the concrete examples are the parts that land cleanly. The soft spot is exactly the one the stress-test flags. The abstract states that the operator has a fixed point that then feeds the DP, but gives no conditions, contraction argument, or monotonicity result to guarantee existence or uniqueness. Without those, the nested construction is not obviously well-defined, and the subsequent DP step could rest on shaky ground. The full paper may contain the missing proof, but nothing in the provided material shows it. The rest of the setup looks standard and the citation pattern is appropriate for the subfield. This is for control theorists already working on regret or adaptive methods for uncertain finite-state systems; a reader looking for a new DP-style tool in that niche would find the operator and the interpolation behavior worth seeing. It is worth sending to peer review so the fixed-point claim can be checked properly rather than desk-rejected on the abstract alone.

Referee Report

1 major / 1 minor

Summary. The paper studies regret-optimal control for finite-state systems subject to exogenous disturbances. It formalizes the objective via dynamic regret against a lookahead benchmark controller and claims a nested dynamic-programming solution: a Regret-Bellman operator is introduced whose fixed-point value function is then used as input to a finite-horizon dynamic program that yields both the optimal worst-case regret and the regret-optimal policy. Numerical examples are said to show that the resulting policies interpolate between standard MDP and robust controllers.

Significance. If the fixed-point construction is rigorously justified, the approach supplies a computationally tractable, distribution-free alternative to classical MDP and H-infinity-style robust control for finite-state systems. The ability to obtain policies that perform competitively under both i.i.d. and structured disturbances without prior knowledge of the disturbance law would be a useful addition to the control literature.

major comments (1)

[Main result (Regret-Bellman operator and nested DP)] The central claim rests on the Regret-Bellman operator admitting a fixed point that can be fed into the subsequent finite-horizon DP. No contraction mapping, monotonicity, or other sufficient conditions guaranteeing existence (let alone uniqueness) of this fixed point are stated or proved in the abstract or the high-level description of the main result. Without such conditions the nested construction is not known to be well-defined.

minor comments (1)

[Numerical examples] The numerical examples are described only at a high level; concrete system dimensions, disturbance realizations, and quantitative regret values relative to the MDP and robust baselines would help readers assess the practical advantage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment below and will revise the manuscript to improve clarity on the well-definedness of the nested construction.

read point-by-point responses

Referee: [Main result (Regret-Bellman operator and nested DP)] The central claim rests on the Regret-Bellman operator admitting a fixed point that can be fed into the subsequent finite-horizon DP. No contraction mapping, monotonicity, or other sufficient conditions guaranteeing existence (let alone uniqueness) of this fixed point are stated or proved in the abstract or the high-level description of the main result. Without such conditions the nested construction is not known to be well-defined.

Authors: We agree that the existence (and uniqueness) of the fixed point must be rigorously justified for the nested dynamic program to be well-defined. The full manuscript establishes this via a contraction-mapping argument: under the maintained assumption of bounded exogenous disturbances, the Regret-Bellman operator is shown to be a contraction with respect to the supremum norm, so that the Banach fixed-point theorem directly yields existence and uniqueness of the fixed-point value function. This justification appears in the technical development of the main result. However, we acknowledge that the abstract and the high-level overview do not explicitly reference the contraction property or the theorem invoked. We will revise both the abstract and the statement of the main result to include a concise mention of these conditions and the resulting guarantee, thereby making the well-definedness of the construction transparent at the high level. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's core construction defines a Regret-Bellman operator directly from the dynamic-regret objective, the finite-state dynamics, and the lookahead benchmark controller. The fixed-point value function is then inserted into a standard finite-horizon dynamic program. This follows the conventional pattern of value iteration for MDPs or robust control problems, where the operator is assembled from the problem primitives rather than presupposing its own solution. No equation reduces the claimed optimal regret or policy to a fitted parameter, a self-citation, or a renamed input; the finite-state assumption supplies the necessary compactness for the DP to be well-posed in principle, and the abstract presents the result as a computational procedure rather than a tautology. The derivation therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard dynamic programming assumptions for finite-state systems and the existence of a lookahead benchmark; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Finite-state system with exogenous disturbances admits a dynamic programming solution via Regret-Bellman operator
Invoked in the main result description for computing optimal regret and policy.

pith-pipeline@v0.9.0 · 5452 in / 1166 out tokens · 27516 ms · 2026-05-08T05:44:49.894221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Model predictive control: Theory and practice—a survey,

C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: Theory and practice—a survey,”Automatica, vol. 25, no. 3, pp. 335–348, 1989

work page 1989
[2]

A survey of industrial model predictive control technology,

S. J. Qin and T. A. Badgwell, “A survey of industrial model predictive control technology,”Control engineering practice, vol. 11, no. 7, pp. 733–764, 2003

work page 2003
[3]

On the value of preview information for safety control,

Z. Liu and N. Ozay, “On the value of preview information for safety control,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 2348–2354

work page 2021
[4]

Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity,

Y . Lin, Y . Hu, G. Qu, T. Li, and A. Wierman, “Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 174–36 187, 2022

work page 2022
[5]

The power of predictions in online control,

C. Yu, G. Shi, S.-J. Chung, Y . Yue, and A. Wierman, “The power of predictions in online control,”Advances in Neural Information Processing Systems, vol. 33, pp. 1994–2004, 2020

work page 1994
[6]

The value of reward lookahead in reinforcement learning,

N. Merlis, D. Baudry, and V . Perchet, “The value of reward lookahead in reinforcement learning,”Advances in Neural Information Processing Systems, vol. 37, pp. 83 627–83 664, 2024

work page 2024
[7]

Reinforcement learning with lookahead information,

N. Merlis, “Reinforcement learning with lookahead information,”Advances in Neural Information Processing Systems, vol. 37, pp. 64 523– 64 581, 2024

work page 2024
[8]

Online markov decision processes,

E. Even-Dar, S. M. Kakade, and Y . Mansour, “Online markov decision processes,”Mathematics of Operations Research, vol. 34, no. 3, pp. 726–736, 2009

work page 2009
[9]

The nonstochastic control problem,

E. Hazan, S. Kakade, and K. Singh, “The nonstochastic control problem,” inAlgorithmic Learning Theory. PMLR, 2020, pp. 408–421

work page 2020
[10]

Dynamic regret of policy optimization in non-stationary environments,

Y . Fei, Z. Yang, Z. Wang, and Q. Xie, “Dynamic regret of policy optimization in non-stationary environments,”Advances in Neural Information Processing Systems, vol. 33, pp. 6743–6754, 2020

work page 2020
[11]

Dynamic regret of online markov decision processes,

P. Zhao, L.-F. Li, and Z.-H. Zhou, “Dynamic regret of online markov decision processes,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 26 865–26 894

work page 2022
[12]

Regret-optimal controller for the full-information problem,

O. Sabag, G. Goel, S. Lale, and B. Hassibi, “Regret-optimal controller for the full-information problem,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 4777–4782

work page 2021
[13]

Regret-optimal estimation and control,

G. Goel and B. Hassibi, “Regret-optimal estimation and control,”IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 3041–3053, 2023

work page 2023
[14]

A system level approach to regret optimal control,

A. Didier, J. Sieber, and M. N. Zeilinger, “A system level approach to regret optimal control,”IEEE Control Systems Letters, vol. 6, pp. 2792–2797, 2022

work page 2022
[15]

Regret optimal control for uncertain stochastic systems,

A. Martin, L. Furieri, F. D ¨orfler, J. Lygeros, and G. Ferrari-Trecate, “Regret optimal control for uncertain stochastic systems,”European Journal of Control, vol. 80, p. 101051, 2024

work page 2024
[16]

Regret-optimal control under partial observability,

J. Hajar, O. Sabag, and B. Hassibi, “Regret-optimal control under partial observability,” in2024 American Control Conference (ACC), 2024, pp. 4072–4077

work page 2024
[17]

What you should know about approximate dynamic programming,

W. B. Powell, “What you should know about approximate dynamic programming,”Naval Research Logistics (NRL), vol. 56, no. 3, pp. 239–249, 2009

work page 2009
[18]

Strong functional representation lemma and applications to coding theorems,

C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,”IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6967–6978, 2018

work page 2018
[19]

Old and new methods for lost-sales inventory systems,

P. Zipkin, “Old and new methods for lost-sales inventory systems,”Operations research, vol. 56, no. 5, pp. 1256–1263, 2008

work page 2008
[20]

Robust dynamic programming,

G. N. Iyengar, “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

work page 2005
[21]

Robust control of markov decision processes with uncertain transition matrices,

A. Nilim and L. El Ghaoui, “Robust control of markov decision processes with uncertain transition matrices,”Operations Research, vol. 53, no. 5, pp. 780–798, 2005

work page 2005
[22]

M. L. Puterman,Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[1] [1]

Model predictive control: Theory and practice—a survey,

C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: Theory and practice—a survey,”Automatica, vol. 25, no. 3, pp. 335–348, 1989

work page 1989

[2] [2]

A survey of industrial model predictive control technology,

S. J. Qin and T. A. Badgwell, “A survey of industrial model predictive control technology,”Control engineering practice, vol. 11, no. 7, pp. 733–764, 2003

work page 2003

[3] [3]

On the value of preview information for safety control,

Z. Liu and N. Ozay, “On the value of preview information for safety control,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 2348–2354

work page 2021

[4] [4]

Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity,

Y . Lin, Y . Hu, G. Qu, T. Li, and A. Wierman, “Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 174–36 187, 2022

work page 2022

[5] [5]

The power of predictions in online control,

C. Yu, G. Shi, S.-J. Chung, Y . Yue, and A. Wierman, “The power of predictions in online control,”Advances in Neural Information Processing Systems, vol. 33, pp. 1994–2004, 2020

work page 1994

[6] [6]

The value of reward lookahead in reinforcement learning,

N. Merlis, D. Baudry, and V . Perchet, “The value of reward lookahead in reinforcement learning,”Advances in Neural Information Processing Systems, vol. 37, pp. 83 627–83 664, 2024

work page 2024

[7] [7]

Reinforcement learning with lookahead information,

N. Merlis, “Reinforcement learning with lookahead information,”Advances in Neural Information Processing Systems, vol. 37, pp. 64 523– 64 581, 2024

work page 2024

[8] [8]

Online markov decision processes,

E. Even-Dar, S. M. Kakade, and Y . Mansour, “Online markov decision processes,”Mathematics of Operations Research, vol. 34, no. 3, pp. 726–736, 2009

work page 2009

[9] [9]

The nonstochastic control problem,

E. Hazan, S. Kakade, and K. Singh, “The nonstochastic control problem,” inAlgorithmic Learning Theory. PMLR, 2020, pp. 408–421

work page 2020

[10] [10]

Dynamic regret of policy optimization in non-stationary environments,

Y . Fei, Z. Yang, Z. Wang, and Q. Xie, “Dynamic regret of policy optimization in non-stationary environments,”Advances in Neural Information Processing Systems, vol. 33, pp. 6743–6754, 2020

work page 2020

[11] [11]

Dynamic regret of online markov decision processes,

P. Zhao, L.-F. Li, and Z.-H. Zhou, “Dynamic regret of online markov decision processes,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 26 865–26 894

work page 2022

[12] [12]

Regret-optimal controller for the full-information problem,

O. Sabag, G. Goel, S. Lale, and B. Hassibi, “Regret-optimal controller for the full-information problem,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 4777–4782

work page 2021

[13] [13]

Regret-optimal estimation and control,

G. Goel and B. Hassibi, “Regret-optimal estimation and control,”IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 3041–3053, 2023

work page 2023

[14] [14]

A system level approach to regret optimal control,

A. Didier, J. Sieber, and M. N. Zeilinger, “A system level approach to regret optimal control,”IEEE Control Systems Letters, vol. 6, pp. 2792–2797, 2022

work page 2022

[15] [15]

Regret optimal control for uncertain stochastic systems,

A. Martin, L. Furieri, F. D ¨orfler, J. Lygeros, and G. Ferrari-Trecate, “Regret optimal control for uncertain stochastic systems,”European Journal of Control, vol. 80, p. 101051, 2024

work page 2024

[16] [16]

Regret-optimal control under partial observability,

J. Hajar, O. Sabag, and B. Hassibi, “Regret-optimal control under partial observability,” in2024 American Control Conference (ACC), 2024, pp. 4072–4077

work page 2024

[17] [17]

What you should know about approximate dynamic programming,

W. B. Powell, “What you should know about approximate dynamic programming,”Naval Research Logistics (NRL), vol. 56, no. 3, pp. 239–249, 2009

work page 2009

[18] [18]

Strong functional representation lemma and applications to coding theorems,

C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,”IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6967–6978, 2018

work page 2018

[19] [19]

Old and new methods for lost-sales inventory systems,

P. Zipkin, “Old and new methods for lost-sales inventory systems,”Operations research, vol. 56, no. 5, pp. 1256–1263, 2008

work page 2008

[20] [20]

Robust dynamic programming,

G. N. Iyengar, “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

work page 2005

[21] [21]

Robust control of markov decision processes with uncertain transition matrices,

A. Nilim and L. El Ghaoui, “Robust control of markov decision processes with uncertain transition matrices,”Operations Research, vol. 53, no. 5, pp. 780–798, 2005

work page 2005

[22] [22]

M. L. Puterman,Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014