Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits
Pith reviewed 2026-05-08 04:37 UTC · model grok-4.3
The pith
A directional bias certificate lets linear bandits safely exploit biased offline data without dimension-dependent regret.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the directional bias certificate (M_bias, ρ) that quantifies the offline-to-online gap through an M_bias-induced norm and assigns different bias budgets to different directions. Building on this certificate we propose Ellipsoidal-MINUCB, which augments the online learner with an offline-pooled branch that safely exploits historical data. When the certificate is known the algorithm matches the SupLinUCB regret rate in the worst case and improves whenever offline coverage aligns with low-bias directions. When the certificate is unknown we estimate it adaptively from offline and accumulated online data and prove a corresponding regret guarantee.
What carries the argument
The directional bias certificate (M_bias, ρ) that measures the offline-to-online gap via an M_bias-induced norm and allocates separate bias budgets to different feature directions.
If this is right
- The algorithm matches the standard SupLinUCB regret rate of order d sqrt(T) in the worst case when the certificate is known.
- Regret strictly improves over the standard rate whenever offline coverage aligns with low-bias directions.
- Adaptive estimation of the unknown certificate from offline and online data yields a corresponding regret bound.
- Numerical experiments confirm practical gains precisely in the aligned regimes predicted by the theory.
Where Pith is reading between the lines
- The same directional certificate idea could be tested in non-linear or kernelized bandits by defining an analogous norm on the reproducing kernel Hilbert space.
- In practice, one could pre-compute candidate M_bias matrices from historical logs to decide which directions are safe to exploit before deployment.
- The geometry of the feature space becomes a first-class object for transfer between offline and online regimes, suggesting new diagnostics that inspect per-direction coverage before running the bandit.
Load-bearing premise
The offline-to-online gap is directional and can be captured by an M_bias-induced norm certificate that permits safe exploitation without forcing dimension-dependent regret even when the certificate must be estimated from data.
What would settle it
Construct a linear bandit instance with known certificate where bias exists in only one unknown direction; if the algorithm still incurs regret scaling with the full dimension d rather than improving relative to standard SupLinUCB, the improvement claim is falsified.
Figures
read the original abstract
Many bandit systems are deployed with offline historical data, such as past logs from earlier policies. Using these data can reduce early online exploration when they remain informative for the online problem. When the offline and online environments differ, such data can be biased for the online problem. For linear (contextual) bandits, this bias is directional: offline data may be informative in some feature directions and misleading in others. However, prior work typically controls this gap through a known Euclidean bound on the model parameters, which we prove is too coarse: even with the offline parameter known, bias in a single unknown direction can force dimension-dependent regret. To address this challenge, we introduce a directional bias certificate $(M_{\mathrm{bias}},\rho)$ that measures the offline-to-online gap through an $M_{\mathrm{bias}}$-induced norm and assigns different bias budgets to different directions. Building on this certificate, we propose \emph{Ellipsoidal-MINUCB}, which augments the online learning with an offline-pooled branch that safely exploits historical data. When the certificate is known, we show that the algorithm matches the standard SupLinUCB rate in the worst case and improves when offline coverage aligns with low-bias directions. When the certificate is unknown, we estimate it adaptively from offline and accumulated online data and establish a corresponding regret guarantee. Numerical experiments support the theory and show gains in aligned regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a directional bias certificate (M_bias, ρ) that quantifies the offline-to-online gap in linear contextual bandits via an M_bias-induced norm, rather than a coarse Euclidean bound. It proposes the Ellipsoidal-MINUCB algorithm, which augments standard online learning with an offline-pooled branch for safe exploitation. When the certificate is known, the algorithm is shown to match the SupLinUCB regret rate in the worst case while improving in aligned regimes. When unknown, the certificate is estimated adaptively from offline and online data, with a corresponding regret guarantee claimed. Experiments illustrate gains in aligned settings.
Significance. If the regret bounds hold, particularly the adaptive estimation case, the work meaningfully advances offline-to-online bandit learning by replacing dimension-dependent worst-case analysis with a directional certificate that permits safe exploitation without forcing d-linear regret. The explicit matching to SupLinUCB and the adaptive estimation procedure are strengths; the directional norm idea addresses a genuine limitation of prior Euclidean-gap approaches.
major comments (2)
- Unknown-certificate regime (regret analysis section): the claimed regret bound must explicitly fold in a high-probability operator-norm or dual-norm bound on the estimation error of M_bias and ρ. If the ellipsoid construction only controls error in Frobenius norm, a single low-coverage direction can violate the safe-exploitation condition and reintroduce dimension dependence, exactly as the Euclidean case the authors criticize. The current sketch does not appear to close this gap.
- Theorem for known certificate (main regret theorem): the proof that the algorithm matches SupLinUCB in the worst case while improving under alignment relies on the certificate being exactly known; any looseness in how the M_bias-induced norm is used to shape the offline confidence ellipsoid must be quantified, otherwise the 'parameter-free' improvement claim reduces to a standard bound.
minor comments (2)
- Notation for the directional norm and the definition of the offline-pooled branch should be introduced with an explicit example (e.g., a 2-d case) to clarify how different bias budgets are assigned per direction.
- The experimental section would benefit from reporting the estimated M_bias matrices and the alignment metric used to generate the 'aligned' vs. 'misaligned' regimes, so readers can reproduce the directional effect.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments point-by-point below. Both points identify areas where the current analysis sketch can be strengthened, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Unknown-certificate regime (regret analysis section): the claimed regret bound must explicitly fold in a high-probability operator-norm or dual-norm bound on the estimation error of M_bias and ρ. If the ellipsoid construction only controls error in Frobenius norm, a single low-coverage direction can violate the safe-exploitation condition and reintroduce dimension dependence, exactly as the Euclidean case the authors criticize. The current sketch does not appear to close this gap.
Authors: We agree that the estimation error must be controlled in the operator (or dual) norm to preserve directional safety. In the revision we will augment the unknown-certificate analysis with explicit high-probability bounds on ||M̂_bias − M_bias||_op and on the error in ρ, obtained via matrix concentration on the pooled offline-plus-online covariance. These bounds will be folded directly into the regret statement so that the safe-exploitation condition holds uniformly and dimension dependence is avoided. revision: yes
-
Referee: Theorem for known certificate (main regret theorem): the proof that the algorithm matches SupLinUCB in the worst case while improving under alignment relies on the certificate being exactly known; any looseness in how the M_bias-induced norm is used to shape the offline confidence ellipsoid must be quantified, otherwise the 'parameter-free' improvement claim reduces to a standard bound.
Authors: We thank the referee for this observation. While the theorem assumes exact knowledge of the certificate, the current proof sketch does not fully quantify the geometric effect of the M_bias-norm on the offline ellipsoid. In the revision we will insert a supporting lemma that bounds the difference between the M_bias-induced ellipsoid and its Euclidean counterpart, showing that the per-direction exploration cost improves by a factor depending only on ρ and the alignment of offline data, without hidden looseness or additional d-factors. The worst-case matching to SupLinUCB will be stated explicitly. revision: yes
Circularity Check
No significant circularity detected; regret analysis is self-contained relative to the new certificate.
full rationale
The paper introduces the directional bias certificate (M_bias, ρ) as a modeling device to capture offline-to-online gaps more finely than Euclidean bounds, proves that Euclidean bounds are coarse, and then derives regret guarantees for Ellipsoidal-MINUCB conditionally on the certificate (when known) or via its adaptive estimation (when unknown). These steps do not reduce the claimed bounds to tautologies by construction, nor do they rely on load-bearing self-citations, fitted inputs renamed as predictions, or ansatzes smuggled via prior work. The derivation augments standard SupLinUCB-style analysis with the certificate without the final regret expression collapsing to a redefinition of the certificate itself. This is the normal case of a paper whose central claims rest on independent theoretical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear contextual bandit model with feature vectors and unknown parameter
invented entities (1)
-
Directional bias certificate (M_bias, ρ)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
the optimal action survives elimination at every layer
-
[2]
every arm surviving a narrow layer is near-optimal
-
[3]
stopping at a nonterminal layer certifies a large online width
-
[4]
In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12
the layerwise online widths satisfy an elliptical-potential bound. In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12. The proof of Theorem 1 in Chu et al. [2011] is purely pathwise once those four ingredients hold, so the same dyadic summation yieldsR T ≤C SL p T d L3 T onE. H Proof of Theorem 1 We now prove the pooled part of th...
work page 2011
-
[5]
Let Nj := TX t=1 1{at =j} be the number of pulls of armj
Let Pj denote the law of the interaction under instance j, and let P0 denote the null instance with parameterθ †, under which all arms have mean1/4. Let Nj := TX t=1 1{at =j} be the number of pulls of armj. Choosej ⋆ minimizingE 0[Nj]. SincePd−1 j=1 Nj =T, we have E0[Nj⋆]≤ T d−1 . Small-shift regime.Assume first that ρ≤ 1 8 r d−1 T . SinceT≥d 2 andd≥9, th...
-
[6]
Using the standard Bernoulli bound kl(p, q)≤(q−p) 2/(p(1−q)), kl 1 4 , 1 4 + ρ√ 2 ≤ (ρ/ √ 2)2 (1/4)(3/4−ρ/ √
-
[7]
≤4ρ 2. Since only pulls of armj ⋆ distinguishP 0 fromP j⋆, KL(P0, Pj⋆) =E 0[Nj⋆]·kl 1 4 , 1 4 + ρ√ 2 ≤ 4ρ2T d−1 ≤ 1 16 . Apply the Bretagnolle–Huber inequality: Pj⋆(Ec) +P 0(E)≥ 1 2 e−KL(P0,Pj⋆). Usingd≥9and the bounds above, Pj⋆(Ec)≥ 1 2 e−1/16 − 2 d−1 ≥ 1 5 . On Ec, the optimal arm is pulled at mostT /2times, so at leastT /2rounds incur gap ρ/ √
-
[8]
Large-shift regime.Now suppose ρ > 1 8 r d−1 T
Therefore Ej⋆[RT ]≥ ρ√ 2 · T 2 ·P j⋆(Ec)≥ ρT 10 √ 2 . Large-shift regime.Now suppose ρ > 1 8 r d−1 T . Set ρ0 := 1 8 r d−1 T . 26 By T≥d 2 and d≥9 , we have ρ0 ≤1/4 , so the same construction is admissible with ρ0 in place of ρ. Every instance allowed under ρ0 is also allowed under ρ, so the lower bound from the small-shift regime applies withρ 0: sup I E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.