On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Qiaomin Xie; Ruihao Zhu; Yixuan Zhang

arxiv: 2603.18514 · v2 · submitted 2026-03-19 · 📊 stat.ML · cs.LG

On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Yixuan Zhang , Ruihao Zhu , Qiaomin Xie This is my paper

Pith reviewed 2026-05-15 09:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords nonstationary banditssatisficing regretpiecewise stationaryFano inequalityregret lower boundsrealizable setting

0 comments

The pith

In realizable piecewise-stationary bandits with L at least 2 segments, the best satisficing regret grows as Theta of L log T.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that satisficing regret in K-armed bandits cannot stay bounded once the environment has even two stationary segments under the general realizable model. In the fully stationary case a constant regret bound is possible, but any additional segment forces the lower bound to scale with the horizon T through a logarithmic factor per segment. This lower bound is obtained by extending Fano's inequality to the nonstationary interactive setting via a post-interaction reference construction that compares the learner's observations across segments.

Core claim

In the general realizable, piecewise-stationary setting with L stationary segments, the optimal satisficing regret is Theta(L log T) whenever L is at least 2, whereas the stationary case L equals 1 permits a T-independent constant regret under the same realizability assumption.

What carries the argument

A Fano-based lower-bound framework that constructs a post-interaction reference distribution to handle the nonstationary piecewise-stationary case.

If this is right

Satisficing regret must grow logarithmically with time whenever the number of stationary segments is at least two.
Any algorithm that aims for bounded satisficing regret must fail on some realizable instance once a single change point appears.
The contrast with the L equals 1 case shows that even minimal nonstationarity qualitatively changes the achievable guarantee.
A special regime identified in the paper still permits constant regret when additional structure restricts the possible changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that change-point detection or adaptation carries an unavoidable logarithmic cost in the satisficing metric even when the model is realizable.
It would be natural to ask whether known change locations or restricted families of segment distributions restore constant regret outside the special regime already noted.

Load-bearing premise

The environment is realizable and changes exactly L times with L at least 2, and a post-interaction reference distribution exists that lets the Fano argument go through.

What would settle it

An algorithm that achieves o(L log T) satisficing regret on every realizable piecewise-stationary instance with two segments would falsify the claimed lower bound.

read the original abstract

Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $\Theta(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $\Theta(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript studies satisficing regret in nonstationary K-armed bandits. It claims that in the realizable piecewise-stationary setting with L stationary segments, the minimax regret is Θ(L log T) whenever L ≥ 2. This stands in contrast to the stationary case L = 1, where constant (T-independent) satisficing regret is achievable under realizability. The lower bound is obtained via a novel Fano-type argument that employs a post-interaction reference construction; the paper also identifies a special regime in which constant regret remains possible.

Significance. If the lower-bound construction is valid, the result shows that even a minimal amount of nonstationarity forces logarithmic growth in satisficing regret, which is a sharp and practically relevant distinction from the stationary setting. The post-interaction Fano framework extends classical and interactive Fano techniques and may be reusable in other nonstationary interactive problems. The explicit contrast between L = 1 and L ≥ 2, together with the identification of a constant-regret regime, supplies a clean characterization of when satisficing regret can remain bounded.

major comments (1)

[Fano-based framework / post-interaction reference construction] The central lower bound (abstract and main lower-bound section) rests on the post-interaction reference construction that is asserted to preserve the KL-divergence structure needed for an Ω(log T) Fano term per segment. If the reference is permitted to depend on the realized (policy-dependent) rewards, the effective per-hypothesis divergences can shrink, which would collapse the mutual-information lower bound and invalidate the claimed separation from the L = 1 case. Explicit verification that the reference keeps the relevant KL terms bounded independently of the policy is required for the Θ(L log T) claim to hold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The main concern centers on whether our post-interaction reference construction preserves the necessary KL-divergence bounds independently of the policy. We address this point directly below and clarify the construction.

read point-by-point responses

Referee: [Fano-based framework / post-interaction reference construction] The central lower bound (abstract and main lower-bound section) rests on the post-interaction reference construction that is asserted to preserve the KL-divergence structure needed for an Ω(log T) Fano term per segment. If the reference is permitted to depend on the realized (policy-dependent) rewards, the effective per-hypothesis divergences can shrink, which would collapse the mutual-information lower bound and invalidate the claimed separation from the L = 1 case. Explicit verification that the reference keeps the relevant KL terms bounded independently of the policy is required for the Θ(L log T) claim to hold.

Authors: We appreciate the referee's focus on this technical detail. The post-interaction reference is constructed from the segment-wise hypothesis distributions alone and does not depend on the realized rewards or the policy's action sequence. Formally, for each segment the reference measure is the mixture of the two candidate reward distributions under the fixed (non-policy-dependent) segment length, which ensures that the pairwise KL divergence remains bounded below by a positive constant independent of the observed trajectory. This is established in the proof of Lemma 4.3, where the mutual-information term is lower-bounded by Ω(log T) per segment precisely because the reference is chosen before any interaction occurs. Consequently the Fano inequality yields the claimed Ω(L log T) lower bound and the separation from the L=1 case is preserved. To address the request for explicit verification we will add a short dedicated paragraph and a self-contained lemma statement in Section 4.2 of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; lower bound rests on independent Fano construction

full rationale

The claimed Θ(L log T) optimal regret for L≥2 follows from a novel post-interaction reference construction that extends classical Fano to the piecewise-stationary realizable setting. No equations, fitted parameters, or self-citations are shown that reduce the lower bound to a definition or input by construction. The information-theoretic argument is presented as independent of the upper-bound analysis and does not rely on self-referential definitions or renamings. This is the normal non-circular outcome for a paper whose central result is an external lower-bound argument.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the realizability assumption and the piecewise-stationary model with a fixed number of segments; no free parameters or new invented entities are mentioned in the abstract.

axioms (2)

domain assumption Realizability: reward distributions belong to a known class
Invoked to obtain the constant-regret result for L=1 and the contrast for L>=2
domain assumption Piecewise-stationary environment with exactly L stationary segments
Defines the nonstationary setting in which the Theta(L log T) bound is claimed

pith-pipeline@v0.9.0 · 5491 in / 1369 out tokens · 44362 ms · 2026-05-15T09:02:52.025698+00:00 · methodology

On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)