pith. sign in

arxiv: 2605.20269 · v1 · pith:E36GF3ZMnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

Pith reviewed 2026-05-21 07:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords low-rank banditsnon-stationary banditsdynamic regretsubspace identificationpiecewise stationarycontextual banditschange point detection
4
0 comments X

The pith

Low-rank structure lets scalar-reward bandits recover a drifting subspace when noise variance is known, state-noise coupling bounded, and probes span full dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in piecewise-stationary low-rank linear contextual bandits, a moving r-dimensional subspace can be recovered from single-play scalar rewards through quadratic functionals precisely when three probe conditions hold simultaneously. It shows each condition is necessary and that all three together are sufficient for identification. An algorithm called SPSC exploits this by interleaving isotropic probes with projected ridge-UCB inside the recovered subspace while using a CUSUM detector to locate the unknown change points. The resulting dynamic regret replaces the usual ambient-dimension scaling with a dependence on rank r, at the price of an extra T to the two-thirds term from change-point detection.

Core claim

With single-play scalar rewards the moving subspace is recoverable through quadratic functionals of rewards if and only if three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each condition is necessary in the unrestricted-second-moment problem, and jointly they are sufficient. The SPSC algorithm interleaves isotropic probes with windowed projected ridge-UCB inside the learned r-dimensional subspace and uses a CUSUM-style detector to discover segment boundaries online, attaining costed dynamic regret of order r sqrt(T) plus T to the two-thirds plus a linear term in the number of changes.

What carries the argument

SPSC algorithm that interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned r-dimensional subspace and a CUSUM-style detector for segment boundaries.

If this is right

  • Dynamic regret scales with intrinsic rank r rather than ambient dimension d once the three identification conditions are met.
  • A CUSUM-style detector can locate the unknown change points online without prior knowledge of their number or locations.
  • The method outperforms both non-stationary linear bandits and stationary low-rank bandits on data sets where d minus r is at least order T to the one-sixth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quadratic-functional recovery idea may apply to other drifting latent structures such as slowly changing user embeddings in sequential recommendation.
  • If the three conditions can be enforced by design in a deployed system, the extra T to the two-thirds detection cost becomes the dominant remaining price of non-stationarity.
  • Relaxing the piecewise-stationary assumption to allow gradual drift inside segments would require replacing the CUSUM detector with a different change-point statistic.

Load-bearing premise

The low-rank factor stays exactly constant inside each of a finite number of unknown segments and can change only at the boundaries between them.

What would settle it

Run the identification procedure on data where the noise variance is left unknown; the subspace estimate should fail to converge to the true moving subspace.

Figures

Figures reproduced from arXiv: 2605.20269 by Hamed Khosravi, Xiaoming Huo.

Figure 1
Figure 1. Figure 1: Phase-transition boundary. (a) Empirical ratio=1 contour (solid) closely tracks the theoretical crossover d − r = T 1/6 (dashed); markers indicate SPSC wins (blue circles) and LinUCB wins (red triangles). (b) Ratio versus d for each r (log scale): empirical ratios (solid) match the p r/d rate predicted by Theorem 4.1 (dotted). SPSC achieves p 16–29% reductions whenever d ≥ 45 and r ≥ 3 with every r-curve t… view at source ↗
Figure 2
Figure 2. Figure 2: Pendigits operating-regime grid. SPSC and SPSC-Adaptive dominate every non-oracle baseline once d≥55 and r≥10. Synthetic (d, r) grid [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Probe-rate ablation. Left: final regret vs. probe frequency. Right: final control [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rank misspecification robustness (Covertype, [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness and necessity experiments. (a) Variance misspecification, regret stable [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Subspace recovery. ∥Pbk − P ⋆ k ∥2 vs. probe count matches the predicted 1/ √ m rate on a log-log scale [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Change-point adaptation. Instantaneous regret drops sharply after each boundary as fresh probes rebuild the subspace. Change-point frequency. K ∈ {1, 2, 4, 6, 8, 12} at T=6,000. Ratio grows from 0.414 (long segments) to 0.855 (very short); SPSC retains an edge at all frequencies but the margin shrinks as the per-segment probe budget collapses below the identifiability threshold. Drift speed. LDS spectral r… view at source ↗
Figure 8
Figure 8. Figure 8: Noise robustness: SPSC advantage is noise-monotone. [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Change-point frequency [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Drift speed. Benefit gated by LDS correlation time [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Small-d piecewise-stationary stress test of Russac et al. [2019] (d=2, r=1, K=4, T=6,000, 30 seeds). (a) Cumulative control regret; vertical dotted lines mark the three change points. (b) Final control regret per method; SPSC is the best non-oracle baseline. J Warfarin: random-subspace ablation The Warfarin gain reported in §5.3 admits two distinct explanations: projection of the 93-dimensional regression… view at source ↗
Figure 12
Figure 12. Figure 12: Warfarin random-subspace ablation. SPSC’s learned subspace vs. a fresh random rank-r projection at every segment. Half of the gap to LinUCB is dimensionality reduction; the other half is learned signal. K BOSS/Jedra adaptation to the piecewise setting Stationary low-rank methods exploit the same rank structure as SPSC but assume the sub￾space is fixed, so without segment-aware restarts they would fail tri… view at source ↗
Figure 13
Figure 13. Figure 13: SPSC Alg. 1 vs. SPSC-Adaptive under oracle misspecification. Final cumulative costed regret as the oracle-supplied number of segments Koracle varies, with Kreal=5 true subspace shifts (d=60, r=5, T=5,000, 20 seeds; mean ± SE). Alg. 1 is fed Koracle ∈ {5, 10, 20, 40, 80, 150, 200} evenly-spaced boundaries (extras beyond Kreal are false alarms) and degrades monotonically; Alg. 2 runs CUSUM on the clean stre… view at source ↗
read the original abstract

Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $\theta_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies piecewise-stationary low-rank linear contextual bandits with scalar rewards, where the rank-r factor B_k^* is constant within each of K unknown segments and changes only at boundaries. It claims to characterize an identification boundary: the moving subspace is recoverable via quadratic functionals of rewards if and only if three probe-side conditions hold (known noise variance, bounded state-noise coupling, full-dimensional probe support), with necessity shown in the unrestricted-second-moment problem and sufficiency in the bandit setting. The SPSC algorithm interleaves isotropic probes with windowed projected ridge-UCB exploitation and a CUSUM-style detector, attaining costed dynamic regret Õ(r√T) + Õ(T^{2/3}) + O(W V_in). Empirical results on eleven benchmarks show outperformance when d-r ≳ T^{1/6}.

Significance. If the identification boundary is fully characterized for the adaptive single-play setting and the regret bound is tight, the work would be significant for bridging low-rank and non-stationary bandits, replacing ambient d√T rates with intrinsic r dependence. The explicit dependence on r, T, and change cost, together with broad empirical validation across synthetic, UCI, clinical, and production data, would strengthen applicability to recommendation and dosing tasks. The attempt to delineate the solvable region via necessary and sufficient conditions is a positive feature.

major comments (2)
  1. [Abstract / problem formulation] Abstract and problem formulation section: the necessity of the three probe-side conditions (known noise variance, bounded state-noise coupling, full-dimensional probe support) is established only for the unrestricted-second-moment problem. It is unclear whether the counter-examples remain valid under the algorithm's adaptive, history-dependent single-play probes in the linear contextual bandit model, where probe second-moment matrices must be realizable by the learner's own choices. This directly affects whether the three conditions jointly characterize the boundary of the solvable region in the actual setting studied.
  2. [Algorithm and dynamic regret] Regret analysis (implied in the dynamic regret claim): the bound Õ(r√T) + Õ(T^{2/3}) + O(W V_in) replaces the ambient d√T rate, but the derivation of the T^{2/3} term from the CUSUM-style detector and the windowed exploitation should be checked for dependence on the identification conditions; if the necessity gap above is unresolved, the tightness claim along the identification axis is weakened.
minor comments (1)
  1. [Problem formulation] Notation for the change cost V_in and window parameter W should be defined explicitly at first use in the problem formulation to avoid ambiguity with standard non-stationary bandit notation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below with clarifications on the identification results and regret analysis, and propose targeted revisions for improved clarity.

read point-by-point responses
  1. Referee: [Abstract / problem formulation] Abstract and problem formulation section: the necessity of the three probe-side conditions (known noise variance, bounded state-noise coupling, full-dimensional probe support) is established only for the unrestricted-second-moment problem. It is unclear whether the counter-examples remain valid under the algorithm's adaptive, history-dependent single-play probes in the linear contextual bandit model, where probe second-moment matrices must be realizable by the learner's own choices. This directly affects whether the three conditions jointly characterize the boundary of the solvable region in the actual setting studied.

    Authors: We appreciate this observation. The necessity results are derived in the unrestricted-second-moment problem, which allows arbitrary probe second-moment matrices. The counterexamples therein demonstrate that without the three conditions, the subspace cannot be identified even when the learner has full freedom to choose any second-moment matrix. Consequently, these counterexamples remain valid—and in fact are strengthened—in the more restricted adaptive single-play setting, where only certain second-moment matrices are realizable through the learner's choices. Therefore, the conditions characterize the identification boundary for the bandit model as well. To make this explicit, we will add a clarifying remark in the revised manuscript. revision: partial

  2. Referee: [Algorithm and dynamic regret] Regret analysis (implied in the dynamic regret claim): the bound Õ(r√T) + Õ(T^{2/3}) + O(W V_in) replaces the ambient d√T rate, but the derivation of the T^{2/3} term from the CUSUM-style detector and the windowed exploitation should be checked for dependence on the identification conditions; if the necessity gap above is unresolved, the tightness claim along the identification axis is weakened.

    Authors: The regret bound is established assuming the identification conditions hold, which is justified by the sufficiency result in the bandit setting. The Õ(r√T) term reflects the low-rank structure in the exploitation phase after subspace recovery, while the Õ(T^{2/3}) term originates from the change detection delay of the CUSUM detector and the length of the sliding windows used for local estimation; these components depend on the detection accuracy enabled by the conditions but not on the necessity proof. With the clarification above resolving the necessity aspect, the tightness claim holds along the identification axis. We will expand the regret analysis section to include an explicit dependence breakdown. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with explicit problem distinctions

full rationale

The paper's central identification claim explicitly distinguishes the necessity result as holding only 'in the unrestricted-second-moment problem' while sufficiency is developed inside the adaptive single-play bandit model. No step reduces a prediction or boundary characterization to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the present work. The three probe-side conditions are introduced as external modeling assumptions rather than derived from the algorithm's outputs. The regret bound is obtained from the algorithm analysis rather than defined tautologically. This is the normal case of a self-contained derivation against stated assumptions.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claims rest on three explicit probe-side conditions plus the piecewise-stationary model; these are stated as necessary and jointly sufficient rather than fitted.

free parameters (2)
  • rank r
    Assumed known or selected; the regret scales with r but the paper does not detail how r is chosen from data.
  • number of segments K
    Unknown a priori; detected online via CUSUM-style procedure.
axioms (3)
  • domain assumption Known noise variance
    Required for the quadratic-functional recovery of the moving subspace; stated as one of the three necessary conditions.
  • domain assumption Bounded state-noise coupling
    One of the three probe-side conditions necessary for identification.
  • domain assumption Full-dimensional probe support
    Third necessary condition for recovering the subspace from scalar rewards.

pith-pipeline@v0.9.0 · 5890 in / 1570 out tokens · 42461 ms · 2026-05-21T07:30:57.973208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    New England Journal of Medicine , volume=

    Estimation of the warfarin dose with clinical and pharmacogenetic data , author=. New England Journal of Medicine , volume=. 2009 , publisher=

  2. [2]

    Advances in neural information processing systems , volume=

    Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=

  3. [3]

    International Conference on Machine Learning , pages=

    Bilinear bandits with low-rank structure , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  4. [4]

    International conference on artificial intelligence and statistics , pages=

    Low-rank generalized linear bandit problems , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=

  5. [5]

    The 22nd International Conference on Artificial Intelligence and Statistics , pages=

    Learning to optimize under non-stationarity , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Weighted linear bandits for non-stationary environments , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    Journal of Machine Learning Research , volume=

    A new look at dynamic regret for non-stationary stochastic bandits , author=. Journal of Machine Learning Research , volume=

  8. [8]

    International Conference on Machine Learning , pages=

    Prediction with limited advice and multiarmed bandits with paid observations , author=. International Conference on Machine Learning , pages=. 2014 , organization=

  9. [9]

    Uncertainty in Artificial Intelligence , pages=

    Bandits with costly reward observations , author=. Uncertainty in Artificial Intelligence , pages=. 2023 , organization=

  10. [10]

    IEEE Transactions on Information Theory , volume=

    Multi-armed bandits with costly probes , author=. IEEE Transactions on Information Theory , volume=. 2024 , publisher=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Spectral entry-wise matrix estimation for low-rank reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Almost minimax optimal best arm identification in piecewise stationary linear bandits , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Efficient frameworks for generalized low-rank matrix bandit problems , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    Journal of Machine Learning Research , volume=

    Finite-time analysis of globally nonstationary multi-armed bandits , author=. Journal of Machine Learning Research , volume=

  15. [15]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Efficient low-rank matrix estimation, experimental design, and arm-set-dependent low-rank bandits , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  16. [16]

    International Conference on Machine Learning , pages=

    Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  17. [17]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

    Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

  18. [18]

    American journal of health-system pharmacy , volume=

    Therapeutic monitoring of vancomycin for serious methicillin-resistant Staphylococcus aureus infections: a revised consensus guideline and review by the American Society of Health-System Pharmacists, the Infectious Diseases Society of America, the Pediatric Infectious Diseases Society, and the Society of Infectious Diseases Pharmacists , author=. American...

  19. [19]

    Electronic Communications in Probability , volume=

    Freedman's inequality for matrix martingales , author=. Electronic Communications in Probability , volume=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Beyond task diversity: provable representation transfer for sequential multitask linear bandits , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    arXiv preprint arXiv:2509.05460 , year=

    Calibrated Recommendations with Contextual Bandits , author=. arXiv preprint arXiv:2509.05460 , year=

  22. [22]

    Biometrics , volume=

    Change surface regression for nonlinear subgroup identification with application to warfarin pharmacogenomics data , author=. Biometrics , volume=. 2025 , publisher=

  23. [23]

    2013 , publisher=

    Matrix analysis , author=. 2013 , publisher=

  24. [24]

    Annals of statistics , pages=

    Adaptive estimation of a quadratic functional by model selection , author=. Annals of statistics , pages=. 2000 , publisher=