Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Pith reviewed 2026-05-21 07:30 UTC · model grok-4.3
The pith
Low-rank structure lets scalar-reward bandits recover a drifting subspace when noise variance is known, state-noise coupling bounded, and probes span full dimension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With single-play scalar rewards the moving subspace is recoverable through quadratic functionals of rewards if and only if three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each condition is necessary in the unrestricted-second-moment problem, and jointly they are sufficient. The SPSC algorithm interleaves isotropic probes with windowed projected ridge-UCB inside the learned r-dimensional subspace and uses a CUSUM-style detector to discover segment boundaries online, attaining costed dynamic regret of order r sqrt(T) plus T to the two-thirds plus a linear term in the number of changes.
What carries the argument
SPSC algorithm that interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned r-dimensional subspace and a CUSUM-style detector for segment boundaries.
If this is right
- Dynamic regret scales with intrinsic rank r rather than ambient dimension d once the three identification conditions are met.
- A CUSUM-style detector can locate the unknown change points online without prior knowledge of their number or locations.
- The method outperforms both non-stationary linear bandits and stationary low-rank bandits on data sets where d minus r is at least order T to the one-sixth.
Where Pith is reading between the lines
- The same quadratic-functional recovery idea may apply to other drifting latent structures such as slowly changing user embeddings in sequential recommendation.
- If the three conditions can be enforced by design in a deployed system, the extra T to the two-thirds detection cost becomes the dominant remaining price of non-stationarity.
- Relaxing the piecewise-stationary assumption to allow gradual drift inside segments would require replacing the CUSUM detector with a different change-point statistic.
Load-bearing premise
The low-rank factor stays exactly constant inside each of a finite number of unknown segments and can change only at the boundaries between them.
What would settle it
Run the identification procedure on data where the noise variance is left unknown; the subspace estimate should fail to converge to the true moving subspace.
Figures
read the original abstract
Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $\theta_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies piecewise-stationary low-rank linear contextual bandits with scalar rewards, where the rank-r factor B_k^* is constant within each of K unknown segments and changes only at boundaries. It claims to characterize an identification boundary: the moving subspace is recoverable via quadratic functionals of rewards if and only if three probe-side conditions hold (known noise variance, bounded state-noise coupling, full-dimensional probe support), with necessity shown in the unrestricted-second-moment problem and sufficiency in the bandit setting. The SPSC algorithm interleaves isotropic probes with windowed projected ridge-UCB exploitation and a CUSUM-style detector, attaining costed dynamic regret Õ(r√T) + Õ(T^{2/3}) + O(W V_in). Empirical results on eleven benchmarks show outperformance when d-r ≳ T^{1/6}.
Significance. If the identification boundary is fully characterized for the adaptive single-play setting and the regret bound is tight, the work would be significant for bridging low-rank and non-stationary bandits, replacing ambient d√T rates with intrinsic r dependence. The explicit dependence on r, T, and change cost, together with broad empirical validation across synthetic, UCI, clinical, and production data, would strengthen applicability to recommendation and dosing tasks. The attempt to delineate the solvable region via necessary and sufficient conditions is a positive feature.
major comments (2)
- [Abstract / problem formulation] Abstract and problem formulation section: the necessity of the three probe-side conditions (known noise variance, bounded state-noise coupling, full-dimensional probe support) is established only for the unrestricted-second-moment problem. It is unclear whether the counter-examples remain valid under the algorithm's adaptive, history-dependent single-play probes in the linear contextual bandit model, where probe second-moment matrices must be realizable by the learner's own choices. This directly affects whether the three conditions jointly characterize the boundary of the solvable region in the actual setting studied.
- [Algorithm and dynamic regret] Regret analysis (implied in the dynamic regret claim): the bound Õ(r√T) + Õ(T^{2/3}) + O(W V_in) replaces the ambient d√T rate, but the derivation of the T^{2/3} term from the CUSUM-style detector and the windowed exploitation should be checked for dependence on the identification conditions; if the necessity gap above is unresolved, the tightness claim along the identification axis is weakened.
minor comments (1)
- [Problem formulation] Notation for the change cost V_in and window parameter W should be defined explicitly at first use in the problem formulation to avoid ambiguity with standard non-stationary bandit notation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below with clarifications on the identification results and regret analysis, and propose targeted revisions for improved clarity.
read point-by-point responses
-
Referee: [Abstract / problem formulation] Abstract and problem formulation section: the necessity of the three probe-side conditions (known noise variance, bounded state-noise coupling, full-dimensional probe support) is established only for the unrestricted-second-moment problem. It is unclear whether the counter-examples remain valid under the algorithm's adaptive, history-dependent single-play probes in the linear contextual bandit model, where probe second-moment matrices must be realizable by the learner's own choices. This directly affects whether the three conditions jointly characterize the boundary of the solvable region in the actual setting studied.
Authors: We appreciate this observation. The necessity results are derived in the unrestricted-second-moment problem, which allows arbitrary probe second-moment matrices. The counterexamples therein demonstrate that without the three conditions, the subspace cannot be identified even when the learner has full freedom to choose any second-moment matrix. Consequently, these counterexamples remain valid—and in fact are strengthened—in the more restricted adaptive single-play setting, where only certain second-moment matrices are realizable through the learner's choices. Therefore, the conditions characterize the identification boundary for the bandit model as well. To make this explicit, we will add a clarifying remark in the revised manuscript. revision: partial
-
Referee: [Algorithm and dynamic regret] Regret analysis (implied in the dynamic regret claim): the bound Õ(r√T) + Õ(T^{2/3}) + O(W V_in) replaces the ambient d√T rate, but the derivation of the T^{2/3} term from the CUSUM-style detector and the windowed exploitation should be checked for dependence on the identification conditions; if the necessity gap above is unresolved, the tightness claim along the identification axis is weakened.
Authors: The regret bound is established assuming the identification conditions hold, which is justified by the sufficiency result in the bandit setting. The Õ(r√T) term reflects the low-rank structure in the exploitation phase after subspace recovery, while the Õ(T^{2/3}) term originates from the change detection delay of the CUSUM detector and the length of the sliding windows used for local estimation; these components depend on the detection accuracy enabled by the conditions but not on the necessity proof. With the clarification above resolving the necessity aspect, the tightness claim holds along the identification axis. We will expand the regret analysis section to include an explicit dependence breakdown. revision: partial
Circularity Check
No significant circularity; derivation self-contained with explicit problem distinctions
full rationale
The paper's central identification claim explicitly distinguishes the necessity result as holding only 'in the unrestricted-second-moment problem' while sufficiency is developed inside the adaptive single-play bandit model. No step reduces a prediction or boundary characterization to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the present work. The three probe-side conditions are introduced as external modeling assumptions rather than derived from the algorithm's outputs. The regret bound is obtained from the algorithm analysis rather than defined tautologically. This is the normal case of a self-contained derivation against stated assumptions.
Axiom & Free-Parameter Ledger
free parameters (2)
- rank r
- number of segments K
axioms (3)
- domain assumption Known noise variance
- domain assumption Bounded state-noise coupling
- domain assumption Full-dimensional probe support
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lifted probe sample Gt:=K^{-1}(st ut ut⊤), E[Gt|Ht−1]=Mt̃ + Bt̃ with bias Bt̃=−(δσ/d)Id
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
New England Journal of Medicine , volume=
Estimation of the warfarin dose with clinical and pharmacogenetic data , author=. New England Journal of Medicine , volume=. 2009 , publisher=
work page 2009
-
[2]
Advances in neural information processing systems , volume=
Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=
-
[3]
International Conference on Machine Learning , pages=
Bilinear bandits with low-rank structure , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[4]
International conference on artificial intelligence and statistics , pages=
Low-rank generalized linear bandit problems , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=
work page 2021
-
[5]
The 22nd International Conference on Artificial Intelligence and Statistics , pages=
Learning to optimize under non-stationarity , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=
work page 2019
-
[6]
Advances in Neural Information Processing Systems , volume=
Weighted linear bandits for non-stationary environments , author=. Advances in Neural Information Processing Systems , volume=
-
[7]
Journal of Machine Learning Research , volume=
A new look at dynamic regret for non-stationary stochastic bandits , author=. Journal of Machine Learning Research , volume=
-
[8]
International Conference on Machine Learning , pages=
Prediction with limited advice and multiarmed bandits with paid observations , author=. International Conference on Machine Learning , pages=. 2014 , organization=
work page 2014
-
[9]
Uncertainty in Artificial Intelligence , pages=
Bandits with costly reward observations , author=. Uncertainty in Artificial Intelligence , pages=. 2023 , organization=
work page 2023
-
[10]
IEEE Transactions on Information Theory , volume=
Multi-armed bandits with costly probes , author=. IEEE Transactions on Information Theory , volume=. 2024 , publisher=
work page 2024
-
[11]
Advances in Neural Information Processing Systems , volume=
Spectral entry-wise matrix estimation for low-rank reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Advances in Neural Information Processing Systems , volume=
Almost minimax optimal best arm identification in piecewise stationary linear bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[13]
Advances in Neural Information Processing Systems , volume=
Efficient frameworks for generalized low-rank matrix bandit problems , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Journal of Machine Learning Research , volume=
Finite-time analysis of globally nonstationary multi-armed bandits , author=. Journal of Machine Learning Research , volume=
-
[15]
Proceedings of the 41st International Conference on Machine Learning , pages=
Efficient low-rank matrix estimation, experimental design, and arm-set-dependent low-rank bandits , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[16]
International Conference on Machine Learning , pages=
Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[17]
Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
-
[18]
American journal of health-system pharmacy , volume=
Therapeutic monitoring of vancomycin for serious methicillin-resistant Staphylococcus aureus infections: a revised consensus guideline and review by the American Society of Health-System Pharmacists, the Infectious Diseases Society of America, the Pediatric Infectious Diseases Society, and the Society of Infectious Diseases Pharmacists , author=. American...
work page 2020
-
[19]
Electronic Communications in Probability , volume=
Freedman's inequality for matrix martingales , author=. Electronic Communications in Probability , volume=
-
[20]
Advances in Neural Information Processing Systems , volume=
Beyond task diversity: provable representation transfer for sequential multitask linear bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
arXiv preprint arXiv:2509.05460 , year=
Calibrated Recommendations with Contextual Bandits , author=. arXiv preprint arXiv:2509.05460 , year=
-
[22]
Change surface regression for nonlinear subgroup identification with application to warfarin pharmacogenomics data , author=. Biometrics , volume=. 2025 , publisher=
work page 2025
- [23]
-
[24]
Adaptive estimation of a quadratic functional by model selection , author=. Annals of statistics , pages=. 2000 , publisher=
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.