Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits

Jiheng Zhang; Ruihan Lin; Zean Han; Zezhen Ding

arxiv: 2604.24016 · v2 · submitted 2026-04-27 · 💻 cs.LG

Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits

Zean Han , Ruihan Lin , Zezhen Ding , Jiheng Zhang This is my paper

Pith reviewed 2026-05-08 04:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear contextual banditsoffline-to-online learningdirectional bias certificateregret boundsadaptive estimationellipsoidal confidence setshistorical data exploitationdirectional bias

0 comments

The pith

A directional bias certificate lets linear bandits safely exploit biased offline data without dimension-dependent regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that bias between offline historical data and the online bandit environment is directional in linear contextual bandits: offline logs can be reliable along some feature directions yet misleading along others. Standard Euclidean bounds on the parameter gap treat all directions equally and force unnecessary exploration costs even when the offline parameter is known exactly. By introducing a certificate that measures the gap via an M_bias-induced norm and allocates separate bias budgets per direction, the authors build an algorithm that pools offline data only where it is trustworthy. This approach matches the usual worst-case regret rate while delivering strictly lower regret whenever offline coverage aligns with low-bias directions, and it extends to adaptive estimation of the certificate from data.

Core claim

We introduce the directional bias certificate (M_bias, ρ) that quantifies the offline-to-online gap through an M_bias-induced norm and assigns different bias budgets to different directions. Building on this certificate we propose Ellipsoidal-MINUCB, which augments the online learner with an offline-pooled branch that safely exploits historical data. When the certificate is known the algorithm matches the SupLinUCB regret rate in the worst case and improves whenever offline coverage aligns with low-bias directions. When the certificate is unknown we estimate it adaptively from offline and accumulated online data and prove a corresponding regret guarantee.

What carries the argument

The directional bias certificate (M_bias, ρ) that measures the offline-to-online gap via an M_bias-induced norm and allocates separate bias budgets to different feature directions.

If this is right

The algorithm matches the standard SupLinUCB regret rate of order d sqrt(T) in the worst case when the certificate is known.
Regret strictly improves over the standard rate whenever offline coverage aligns with low-bias directions.
Adaptive estimation of the unknown certificate from offline and online data yields a corresponding regret bound.
Numerical experiments confirm practical gains precisely in the aligned regimes predicted by the theory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directional certificate idea could be tested in non-linear or kernelized bandits by defining an analogous norm on the reproducing kernel Hilbert space.
In practice, one could pre-compute candidate M_bias matrices from historical logs to decide which directions are safe to exploit before deployment.
The geometry of the feature space becomes a first-class object for transfer between offline and online regimes, suggesting new diagnostics that inspect per-direction coverage before running the bandit.

Load-bearing premise

The offline-to-online gap is directional and can be captured by an M_bias-induced norm certificate that permits safe exploitation without forcing dimension-dependent regret even when the certificate must be estimated from data.

What would settle it

Construct a linear bandit instance with known certificate where bias exists in only one unknown direction; if the algorithm still incurs regret scaling with the full dimension d rather than improving relative to standard SupLinUCB, the improvement claim is falsified.

Figures

Figures reproduced from arXiv: 2604.24016 by Jiheng Zhang, Ruihan Lin, Zean Han, Zezhen Ding.

**Figure 1.** Figure 1: Main synthetic study: mean cumulative regret with variability bands across seeds. The view at source ↗

**Figure 2.** Figure 2: Randomized shift sweep: for each c, mean cumulative regret under the (s, u) law above (left to right, increasing c; five panels in the published figure). Fixed online parameter, varying offline mean on one coordinate. We fix θ∗ = (2, 1, 1, 1, 1) and change only θ † 1 among {0, 1, 2, 1, 4} (with the remaining coordinates 1 in each case, so the middle and fourth panels repeat the same θ † used as a check on … view at source ↗

**Figure 3.** Figure 3: Ablation: five values of θ † 1 with θ∗ as above, same hyperparameters otherwise. 32 view at source ↗

**Figure 4.** Figure 4: Offline-coverage study: each panel is a value of view at source ↗

read the original abstract

Many bandit systems are deployed with offline historical data, such as past logs from earlier policies. Using these data can reduce early online exploration when they remain informative for the online problem. When the offline and online environments differ, such data can be biased for the online problem. For linear (contextual) bandits, this bias is directional: offline data may be informative in some feature directions and misleading in others. However, prior work typically controls this gap through a known Euclidean bound on the model parameters, which we prove is too coarse: even with the offline parameter known, bias in a single unknown direction can force dimension-dependent regret. To address this challenge, we introduce a directional bias certificate $(M_{\mathrm{bias}},\rho)$ that measures the offline-to-online gap through an $M_{\mathrm{bias}}$-induced norm and assigns different bias budgets to different directions. Building on this certificate, we propose \emph{Ellipsoidal-MINUCB}, which augments the online learning with an offline-pooled branch that safely exploits historical data. When the certificate is known, we show that the algorithm matches the standard SupLinUCB rate in the worst case and improves when offline coverage aligns with low-bias directions. When the certificate is unknown, we estimate it adaptively from offline and accumulated online data and establish a corresponding regret guarantee. Numerical experiments support the theory and show gains in aligned regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The directional bias certificate refines offline-to-online bounds in linear contextual bandits by avoiding crude Euclidean penalties, but the adaptive estimation step risks reintroducing dimension dependence if error terms are not controlled tightly.

read the letter

This paper's key move is replacing the usual uniform bound on offline bias with a directional certificate that uses an M_bias-induced norm to give different budgets to different feature directions. That lets the algorithm exploit offline data where coverage is good without paying full dimension-dependent regret for bias in one bad direction, which the abstract shows can happen even with known offline parameters under Euclidean controls. Ellipsoidal-MINUCB adds an offline-pooled branch on top of standard online steps, and the known-certificate case matches SupLinUCB worst-case while improving when alignment holds. The adaptive version estimates the certificate from data and claims a corresponding regret guarantee, with experiments backing the gains in aligned regimes. The idea is new in this form and directly tackles a limitation in prior uniform-bound approaches. The soft spot sits in the unknown-certificate regime. If the high-probability error on the estimated M_bias is not bounded in the precise dual norm that shapes the exploitation ellipsoid, a single poorly covered direction can break the safe-exploitation condition and push regret back to the dimension-dependent case the authors criticize. The stress-test note flags exactly this, and without the full derivations it is unclear whether that term is folded into the final bound or left implicit. The paper is for people working on regret bounds and safe use of historical data in linear bandits. A reader focused on tighter directional analysis would get concrete value from the certificate construction and the algorithm. It deserves a serious referee because the core claim is falsifiable and the setup is standard enough to check. Send it to review, but ask the referees to verify how estimation error on the certificate propagates into the regret expression.

Referee Report

2 major / 2 minor

Summary. The paper introduces a directional bias certificate (M_bias, ρ) that quantifies the offline-to-online gap in linear contextual bandits via an M_bias-induced norm, rather than a coarse Euclidean bound. It proposes the Ellipsoidal-MINUCB algorithm, which augments standard online learning with an offline-pooled branch for safe exploitation. When the certificate is known, the algorithm is shown to match the SupLinUCB regret rate in the worst case while improving in aligned regimes. When unknown, the certificate is estimated adaptively from offline and online data, with a corresponding regret guarantee claimed. Experiments illustrate gains in aligned settings.

Significance. If the regret bounds hold, particularly the adaptive estimation case, the work meaningfully advances offline-to-online bandit learning by replacing dimension-dependent worst-case analysis with a directional certificate that permits safe exploitation without forcing d-linear regret. The explicit matching to SupLinUCB and the adaptive estimation procedure are strengths; the directional norm idea addresses a genuine limitation of prior Euclidean-gap approaches.

major comments (2)

Unknown-certificate regime (regret analysis section): the claimed regret bound must explicitly fold in a high-probability operator-norm or dual-norm bound on the estimation error of M_bias and ρ. If the ellipsoid construction only controls error in Frobenius norm, a single low-coverage direction can violate the safe-exploitation condition and reintroduce dimension dependence, exactly as the Euclidean case the authors criticize. The current sketch does not appear to close this gap.
Theorem for known certificate (main regret theorem): the proof that the algorithm matches SupLinUCB in the worst case while improving under alignment relies on the certificate being exactly known; any looseness in how the M_bias-induced norm is used to shape the offline confidence ellipsoid must be quantified, otherwise the 'parameter-free' improvement claim reduces to a standard bound.

minor comments (2)

Notation for the directional norm and the definition of the offline-pooled branch should be introduced with an explicit example (e.g., a 2-d case) to clarify how different bias budgets are assigned per direction.
The experimental section would benefit from reporting the estimated M_bias matrices and the alignment metric used to generate the 'aligned' vs. 'misaligned' regimes, so readers can reproduce the directional effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point-by-point below. Both points identify areas where the current analysis sketch can be strengthened, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: Unknown-certificate regime (regret analysis section): the claimed regret bound must explicitly fold in a high-probability operator-norm or dual-norm bound on the estimation error of M_bias and ρ. If the ellipsoid construction only controls error in Frobenius norm, a single low-coverage direction can violate the safe-exploitation condition and reintroduce dimension dependence, exactly as the Euclidean case the authors criticize. The current sketch does not appear to close this gap.

Authors: We agree that the estimation error must be controlled in the operator (or dual) norm to preserve directional safety. In the revision we will augment the unknown-certificate analysis with explicit high-probability bounds on ||M̂_bias − M_bias||_op and on the error in ρ, obtained via matrix concentration on the pooled offline-plus-online covariance. These bounds will be folded directly into the regret statement so that the safe-exploitation condition holds uniformly and dimension dependence is avoided. revision: yes
Referee: Theorem for known certificate (main regret theorem): the proof that the algorithm matches SupLinUCB in the worst case while improving under alignment relies on the certificate being exactly known; any looseness in how the M_bias-induced norm is used to shape the offline confidence ellipsoid must be quantified, otherwise the 'parameter-free' improvement claim reduces to a standard bound.

Authors: We thank the referee for this observation. While the theorem assumes exact knowledge of the certificate, the current proof sketch does not fully quantify the geometric effect of the M_bias-norm on the offline ellipsoid. In the revision we will insert a supporting lemma that bounds the difference between the M_bias-induced ellipsoid and its Euclidean counterpart, showing that the per-direction exploration cost improves by a factor depending only on ρ and the alignment of offline data, without hidden looseness or additional d-factors. The worst-case matching to SupLinUCB will be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; regret analysis is self-contained relative to the new certificate.

full rationale

The paper introduces the directional bias certificate (M_bias, ρ) as a modeling device to capture offline-to-online gaps more finely than Euclidean bounds, proves that Euclidean bounds are coarse, and then derives regret guarantees for Ellipsoidal-MINUCB conditionally on the certificate (when known) or via its adaptive estimation (when unknown). These steps do not reduce the claimed bounds to tautologies by construction, nor do they rely on load-bearing self-citations, fitted inputs renamed as predictions, or ansatzes smuggled via prior work. The derivation augments standard SupLinUCB-style analysis with the certificate without the final regret expression collapsing to a redefinition of the certificate itself. This is the normal case of a paper whose central claims rest on independent theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard linear contextual bandit assumptions plus the newly introduced directional certificate; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Linear contextual bandit model with feature vectors and unknown parameter
Standard assumption in the field that the paper builds upon for regret analysis.

invented entities (1)

Directional bias certificate (M_bias, ρ) no independent evidence
purpose: Measures offline-to-online gap via M_bias-induced norm and assigns direction-specific bias budgets
Newly introduced to address directional bias; no independent evidence provided beyond the paper's claims.

pith-pipeline@v0.9.0 · 5556 in / 1355 out tokens · 31158 ms · 2026-05-08T04:37:27.719385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

the optimal action survives elimination at every layer

work page
[2]

every arm surviving a narrow layer is near-optimal

work page
[3]

stopping at a nonterminal layer certifies a large online width

work page
[4]

In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12

the layerwise online widths satisfy an elliptical-potential bound. In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12. The proof of Theorem 1 in Chu et al. [2011] is purely pathwise once those four ingredients hold, so the same dyadic summation yieldsR T ≤C SL p T d L3 T onE. H Proof of Theorem 1 We now prove the pooled part of th...

work page 2011
[5]

Let Nj := TX t=1 1{at =j} be the number of pulls of armj

Let Pj denote the law of the interaction under instance j, and let P0 denote the null instance with parameterθ †, under which all arms have mean1/4. Let Nj := TX t=1 1{at =j} be the number of pulls of armj. Choosej ⋆ minimizingE 0[Nj]. SincePd−1 j=1 Nj =T, we have E0[Nj⋆]≤ T d−1 . Small-shift regime.Assume first that ρ≤ 1 8 r d−1 T . SinceT≥d 2 andd≥9, th...

work page
[6]

Using the standard Bernoulli bound kl(p, q)≤(q−p) 2/(p(1−q)), kl 1 4 , 1 4 + ρ√ 2 ≤ (ρ/ √ 2)2 (1/4)(3/4−ρ/ √

work page
[7]

Since only pulls of armj ⋆ distinguishP 0 fromP j⋆, KL(P0, Pj⋆) =E 0[Nj⋆]·kl 1 4 , 1 4 + ρ√ 2 ≤ 4ρ2T d−1 ≤ 1 16

≤4ρ 2. Since only pulls of armj ⋆ distinguishP 0 fromP j⋆, KL(P0, Pj⋆) =E 0[Nj⋆]·kl 1 4 , 1 4 + ρ√ 2 ≤ 4ρ2T d−1 ≤ 1 16 . Apply the Bretagnolle–Huber inequality: Pj⋆(Ec) +P 0(E)≥ 1 2 e−KL(P0,Pj⋆). Usingd≥9and the bounds above, Pj⋆(Ec)≥ 1 2 e−1/16 − 2 d−1 ≥ 1 5 . On Ec, the optimal arm is pulled at mostT /2times, so at leastT /2rounds incur gap ρ/ √

work page
[8]

Large-shift regime.Now suppose ρ > 1 8 r d−1 T

Therefore Ej⋆[RT ]≥ ρ√ 2 · T 2 ·P j⋆(Ec)≥ ρT 10 √ 2 . Large-shift regime.Now suppose ρ > 1 8 r d−1 T . Set ρ0 := 1 8 r d−1 T . 26 By T≥d 2 and d≥9 , we have ρ0 ≤1/4 , so the same construction is admissible with ρ0 in place of ρ. Every instance allowed under ρ0 is also allowed under ρ, so the lower bound from the small-shift regime applies withρ 0: sup I E...

work page 2024

[1] [1]

the optimal action survives elimination at every layer

work page

[2] [2]

every arm surviving a narrow layer is near-optimal

work page

[3] [3]

stopping at a nonterminal layer certifies a large online width

work page

[4] [4]

In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12

the layerwise online widths satisfy an elliptical-potential bound. In our wrapper these are supplied, respectively, by Lemmas 8, 9, 11, and 12. The proof of Theorem 1 in Chu et al. [2011] is purely pathwise once those four ingredients hold, so the same dyadic summation yieldsR T ≤C SL p T d L3 T onE. H Proof of Theorem 1 We now prove the pooled part of th...

work page 2011

[5] [5]

Let Nj := TX t=1 1{at =j} be the number of pulls of armj

Let Pj denote the law of the interaction under instance j, and let P0 denote the null instance with parameterθ †, under which all arms have mean1/4. Let Nj := TX t=1 1{at =j} be the number of pulls of armj. Choosej ⋆ minimizingE 0[Nj]. SincePd−1 j=1 Nj =T, we have E0[Nj⋆]≤ T d−1 . Small-shift regime.Assume first that ρ≤ 1 8 r d−1 T . SinceT≥d 2 andd≥9, th...

work page

[6] [6]

Using the standard Bernoulli bound kl(p, q)≤(q−p) 2/(p(1−q)), kl 1 4 , 1 4 + ρ√ 2 ≤ (ρ/ √ 2)2 (1/4)(3/4−ρ/ √

work page

[7] [7]

Since only pulls of armj ⋆ distinguishP 0 fromP j⋆, KL(P0, Pj⋆) =E 0[Nj⋆]·kl 1 4 , 1 4 + ρ√ 2 ≤ 4ρ2T d−1 ≤ 1 16

≤4ρ 2. Since only pulls of armj ⋆ distinguishP 0 fromP j⋆, KL(P0, Pj⋆) =E 0[Nj⋆]·kl 1 4 , 1 4 + ρ√ 2 ≤ 4ρ2T d−1 ≤ 1 16 . Apply the Bretagnolle–Huber inequality: Pj⋆(Ec) +P 0(E)≥ 1 2 e−KL(P0,Pj⋆). Usingd≥9and the bounds above, Pj⋆(Ec)≥ 1 2 e−1/16 − 2 d−1 ≥ 1 5 . On Ec, the optimal arm is pulled at mostT /2times, so at leastT /2rounds incur gap ρ/ √

work page

[8] [8]

Large-shift regime.Now suppose ρ > 1 8 r d−1 T

Therefore Ej⋆[RT ]≥ ρ√ 2 · T 2 ·P j⋆(Ec)≥ ρT 10 √ 2 . Large-shift regime.Now suppose ρ > 1 8 r d−1 T . Set ρ0 := 1 8 r d−1 T . 26 By T≥d 2 and d≥9 , we have ρ0 ≤1/4 , so the same construction is admissible with ρ0 in place of ρ. Every instance allowed under ρ0 is also allowed under ρ, so the lower bound from the small-shift regime applies withρ 0: sup I E...

work page 2024