On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization
Pith reviewed 2026-05-15 09:02 UTC · model grok-4.3
The pith
In realizable piecewise-stationary bandits with L at least 2 segments, the best satisficing regret grows as Theta of L log T.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the general realizable, piecewise-stationary setting with L stationary segments, the optimal satisficing regret is Theta(L log T) whenever L is at least 2, whereas the stationary case L equals 1 permits a T-independent constant regret under the same realizability assumption.
What carries the argument
A Fano-based lower-bound framework that constructs a post-interaction reference distribution to handle the nonstationary piecewise-stationary case.
If this is right
- Satisficing regret must grow logarithmically with time whenever the number of stationary segments is at least two.
- Any algorithm that aims for bounded satisficing regret must fail on some realizable instance once a single change point appears.
- The contrast with the L equals 1 case shows that even minimal nonstationarity qualitatively changes the achievable guarantee.
- A special regime identified in the paper still permits constant regret when additional structure restricts the possible changes.
Where Pith is reading between the lines
- The result suggests that change-point detection or adaptation carries an unavoidable logarithmic cost in the satisficing metric even when the model is realizable.
- It would be natural to ask whether known change locations or restricted families of segment distributions restore constant regret outside the special regime already noted.
Load-bearing premise
The environment is realizable and changes exactly L times with L at least 2, and a post-interaction reference distribution exists that lets the Fano argument go through.
What would settle it
An algorithm that achieves o(L log T) satisficing regret on every realizable piecewise-stationary instance with two segments would falsify the claimed lower bound.
read the original abstract
Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $\Theta(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $\Theta(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies satisficing regret in nonstationary K-armed bandits. It claims that in the realizable piecewise-stationary setting with L stationary segments, the minimax regret is Θ(L log T) whenever L ≥ 2. This stands in contrast to the stationary case L = 1, where constant (T-independent) satisficing regret is achievable under realizability. The lower bound is obtained via a novel Fano-type argument that employs a post-interaction reference construction; the paper also identifies a special regime in which constant regret remains possible.
Significance. If the lower-bound construction is valid, the result shows that even a minimal amount of nonstationarity forces logarithmic growth in satisficing regret, which is a sharp and practically relevant distinction from the stationary setting. The post-interaction Fano framework extends classical and interactive Fano techniques and may be reusable in other nonstationary interactive problems. The explicit contrast between L = 1 and L ≥ 2, together with the identification of a constant-regret regime, supplies a clean characterization of when satisficing regret can remain bounded.
major comments (1)
- [Fano-based framework / post-interaction reference construction] The central lower bound (abstract and main lower-bound section) rests on the post-interaction reference construction that is asserted to preserve the KL-divergence structure needed for an Ω(log T) Fano term per segment. If the reference is permitted to depend on the realized (policy-dependent) rewards, the effective per-hypothesis divergences can shrink, which would collapse the mutual-information lower bound and invalidate the claimed separation from the L = 1 case. Explicit verification that the reference keeps the relevant KL terms bounded independently of the policy is required for the Θ(L log T) claim to hold.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The main concern centers on whether our post-interaction reference construction preserves the necessary KL-divergence bounds independently of the policy. We address this point directly below and clarify the construction.
read point-by-point responses
-
Referee: [Fano-based framework / post-interaction reference construction] The central lower bound (abstract and main lower-bound section) rests on the post-interaction reference construction that is asserted to preserve the KL-divergence structure needed for an Ω(log T) Fano term per segment. If the reference is permitted to depend on the realized (policy-dependent) rewards, the effective per-hypothesis divergences can shrink, which would collapse the mutual-information lower bound and invalidate the claimed separation from the L = 1 case. Explicit verification that the reference keeps the relevant KL terms bounded independently of the policy is required for the Θ(L log T) claim to hold.
Authors: We appreciate the referee's focus on this technical detail. The post-interaction reference is constructed from the segment-wise hypothesis distributions alone and does not depend on the realized rewards or the policy's action sequence. Formally, for each segment the reference measure is the mixture of the two candidate reward distributions under the fixed (non-policy-dependent) segment length, which ensures that the pairwise KL divergence remains bounded below by a positive constant independent of the observed trajectory. This is established in the proof of Lemma 4.3, where the mutual-information term is lower-bounded by Ω(log T) per segment precisely because the reference is chosen before any interaction occurs. Consequently the Fano inequality yields the claimed Ω(L log T) lower bound and the separation from the L=1 case is preserved. To address the request for explicit verification we will add a short dedicated paragraph and a self-contained lemma statement in Section 4.2 of the revised manuscript. revision: partial
Circularity Check
No significant circularity; lower bound rests on independent Fano construction
full rationale
The claimed Θ(L log T) optimal regret for L≥2 follows from a novel post-interaction reference construction that extends classical Fano to the piecewise-stationary realizable setting. No equations, fitted parameters, or self-citations are shown that reduce the lower bound to a definition or input by construction. The information-theoretic argument is presented as independent of the upper-bound analysis and does not rely on self-referential definitions or renamings. This is the normal non-circular outcome for a paper whose central result is an external lower-bound argument.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Realizability: reward distributions belong to a known class
- domain assumption Piecewise-stationary environment with exactly L stationary segments
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.