Bernstein-type dimension-free concentration for self-normalised martingales
Pith reviewed 2026-05-22 00:38 UTC · model grok-4.3
The pith
Self-normalised martingales satisfy a dimension-free Bernstein-type tail inequality controlled by information gain.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a dimension-free Bernstein-type tail inequality for self-normalised martingales, where the normalisation uses the predictable quadratic variation and the radius depends on the information gain of the observed covariance.
What carries the argument
The dimension-free Bernstein-type tail inequality for self-normalised martingales, which bounds deviations using information gain instead of ambient dimension.
If this is right
- Ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates.
- Instance-adaptive regret bounds for Hilbert-armed logistic bandits.
- Improved tail bounds that hold uniformly over adaptive choices without dimension penalties.
Where Pith is reading between the lines
- Such inequalities could apply to other adaptive algorithms in functional analysis or kernel methods.
- Testing the bound in finite-dimensional approximations of Hilbert spaces would validate its practical utility.
- The approach might extend to other types of self-normalised processes beyond martingales.
Load-bearing premise
The self-normalised martingales must have their tail bounds controlled by the information gain of the observed covariance in a Hilbert space under adaptive covariate selection.
What would settle it
Constructing an adaptive sequence of Hilbert-valued covariates where the self-normalised martingale deviates beyond the information-gain-controlled radius with positive probability.
read the original abstract
We introduce a dimension-free Bernstein-type tail inequality for self-normalised martingales, where the normalisation uses the predictable quadratic variation and the radius depends on the information gain of the observed covariance. As applications, we provide ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates, and give instance-adaptive regret bounds for Hilbert-armed logistic bandits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a dimension-free Bernstein-type tail inequality for self-normalised martingales in Hilbert space. Normalisation uses the predictable quadratic variation, and the radius is controlled by the information gain of the observed covariance under adaptive covariate selection. Applications include ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates and instance-adaptive regret bounds for Hilbert-armed logistic bandits.
Significance. If the central inequality holds, the result offers a useful advance for concentration inequalities in infinite-dimensional adaptive settings. The information-gain radius provides an adaptive, dimension-free control that could improve bounds in online learning and bandit problems. The direct derivation via exponential-martingale arguments (rather than reduction to prior fitted quantities) is a strength.
major comments (2)
- [§3, Theorem 3.1] §3, Theorem 3.1: the proof that the adaptive dependence is fully absorbed into the information-gain term without introducing hidden dimension dependence or boundedness assumptions on the covariates is only sketched; a fully expanded argument is needed to confirm the dimension-free claim.
- [§5.1, Eq. (18)] §5.1, Eq. (18): the self-normalised process is asserted to remain a martingale under adaptive Hilbert-valued covariate choice, but the verification step that the compensator remains the predictable quadratic variation is not detailed enough to rule out leakage of adaptivity into the radius.
minor comments (2)
- [§2] Notation for the information-gain functional is introduced late; moving the definition to §2 would improve readability.
- [§6] The logistic-regression application in §6 assumes the link function satisfies standard Lipschitz conditions, but an explicit statement of the constant would help readers replicate the ellipsoidal sequence.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on the manuscript. We address each major comment below and have prepared a revised version incorporating the requested expansions and clarifications.
read point-by-point responses
-
Referee: [§3, Theorem 3.1] §3, Theorem 3.1: the proof that the adaptive dependence is fully absorbed into the information-gain term without introducing hidden dimension dependence or boundedness assumptions on the covariates is only sketched; a fully expanded argument is needed to confirm the dimension-free claim.
Authors: We agree that the original proof of Theorem 3.1 was presented in a condensed form. In the revised manuscript we have expanded the argument in Section 3 into a complete, self-contained derivation. The new text explicitly verifies that any dependence arising from the adaptive choice of Hilbert-valued covariates is captured entirely by the information-gain term of the observed covariance operator. The steps rely only on the predictability of the quadratic variation and the definition of information gain; no auxiliary boundedness assumptions on the covariates are added, and the argument remains free of hidden dimension dependence by working directly with the trace-class operator norm. revision: yes
-
Referee: [§5.1, Eq. (18)] §5.1, Eq. (18): the self-normalised process is asserted to remain a martingale under adaptive Hilbert-valued covariate choice, but the verification step that the compensator remains the predictable quadratic variation is not detailed enough to rule out leakage of adaptivity into the radius.
Authors: We thank the referee for highlighting this point. The verification in the original Section 5.1 was indeed too brief. The revised manuscript now contains a detailed check immediately following Equation (18). We show that, conditional on the filtration generated by the past covariates and responses, the increment of the self-normalised process has conditional expectation zero and that its quadratic variation process is predictable by construction of the covariate-selection rule. This establishes that the compensator coincides with the predictable quadratic variation and that no adaptivity leaks into the radius term of the resulting concentration bound. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives a dimension-free Bernstein-type tail inequality for self-normalised martingales directly from the predictable quadratic variation and a standard exponential-martingale argument that absorbs adaptive dependence into an information-gain term. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or imported uniqueness theorem; the central claim and its applications (ellipsoidal confidence sequences, instance-adaptive regret bounds) rest on independent probabilistic arguments rather than renaming or re-deriving prior fitted quantities. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard properties of predictable quadratic variation for martingales
- domain assumption Information gain of observed covariance is well-defined for adaptively chosen Hilbert-valued covariates
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1.1 (Bernstein-type). … ρ⋆_n = inf{ρ≥1: ρ≥γ(ρ⁻¹V_n)} where γ(ρ⁻¹V_n)=½log det(I+ρ⁻¹V_n)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proof of Theorem 3.2 … truncation into head (finite predictable dimension Dn≤4γ(ρ⁻¹V_n)) and tail bounded by Hoeffding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions
SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.
-
Vector-valued self-normalized concentration inequalities beyond sub-Gaussianity
Derives vector-valued self-normalized concentration bounds for light-tailed processes beyond sub-Gaussianity, with applications to online linear regression and linear bandits.
Reference graph
Works this paper leans on
-
[1]
Cited on page 9. K. Azuma. Weighted sums of certain dependent random variables.Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967. Cited on page 3. 12 D. Calandriello, A. Lazaric and M. Valko. Second-order kernel online convex optimization with adaptive sketching. InInternational Conference on Machine Learning, 2017. Cited on page 5. S. R. Ch...
work page 1967
-
[2]
Cited on pages 4, 11. S. Filippi, O. Cappe, A. Garivier and C. Szepesvári. Parametric bandits: The generalized linear case.Advances in Neural Information Processing Systems, 23, 2010. Cited on pages 4, 11. H. Flynn and D. Reeb. Tighter confidence bounds for sequential kernel regression.arXiv preprint arXiv:2403.12732, 2024. Cited on pages 5, 9. D. A. Free...
-
[3]
Cited on page 9. T. L. Lai and H. Robbins. Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes.Zeitschrift für Wahrscheinlichkeitstheorie und ver- wandte Gebiete, 56(3):329–360, 1981. Cited on page 3. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applica- tions to identificat...
work page 1981
-
[4]
Cited on page 3. T. Lattimore and C. Szepesvári.Bandit algorithms. 2020. Cited on page 3. M. Ledoux and M. Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. 1991. Cited on page 3. J. Lee, S.-Y. Yun and K.-S. Jun. A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Inform...
-
[5]
Cited on page 5. S. Vakili, J. Scarlett and T. Javidi. Open problem: Tight online confidence intervals for RKHS elements. InConference on Learning Theory, 2021. Cited on pages 9, 12. M. Valko, N. Korda, R. Munos, I. Flaounas and N. Cristianini. Finite-Time Analysis of Kernelised Contextual Bandits. InUncertainty in Artificial Intelligence, 2013. Cited on ...
-
[6]
Let πn|B be the restriction ofπn to B and let n(πn; 1) = R ∥x∥≤1 πn
Let πn be the isotropic, centred Gaussian measure with varianceρ−1 on H1 ⊕ · · · ⊕Hn. Let πn|B be the restriction ofπn to B and let n(πn; 1) = R ∥x∥≤1 πn
-
[7]
Let µn| 1 2 B be the restriction ofµn to 1 2 B = {x: ∥x∥ ≤ 1 2 } and let n(µn; 1
Let µn be the probability measure with densityµn(dx) = exp{− 1 2 ∥x∥2 Hn}. Let µn| 1 2 B be the restriction ofµn to 1 2 B = {x: ∥x∥ ≤ 1 2 } and let n(µn; 1
-
[8]
= R ∥x∥≤ 1 2 µn(dx). Furthermore, for eachn ∈ N+, let ¯M0 = 1 and ¯Mn := ¯Mn(πn|B) and define Jn(x) = ⟨Sn, x⟩ − ⟨x, Hnx⟩/2 . Then, for anyy ∈ H satisfying ∥y∥ ≤ 1 2, log ¯Mn = Jn(y; ρ) + log n(µn; 1 2) n(πn; 1) , ∀n ∈ N+ . Proof. For eachn ∈ N+, let n(πn; 1) = R ∥x∥≤1 πn and observe that ¯Mn = 1 n(πn; 1) Z ∥x∥≤1 exp{Jn(x)} dx . Let gn = ∇Jn(y) and observe...
-
[9]
· 1 n(µn; 1 2) Z exp{⟨x, g⟩}µn| 1 2 B(dx) (definition of µn| 1 2 B) ≥ n(µn; 1
-
[10]
exp ( 1 n(µn; 1 2) Z ⟨x, g⟩ µn| 1 2 B(dx) ) (Jensen’s inequality) = n(µn; 1
-
[11]
For any n ∈ N+, log n(πn; 1) n(νn; 1
(µn| 1 2 B is zero-mean) Proposition A.4. For any n ∈ N+, log n(πn; 1) n(νn; 1
-
[12]
(9) We now state the proof of Theorem 3.1
≤ 2Dn + γ(ρ−1⟨S⟩n) . (9) We now state the proof of Theorem 3.1. Proposition A.4 is proven in Appendix A.1. 17 Proof of Theorem 3.1.We apply Claim A.3 with y = √ρH −1 n (ρ)Sn 2∥H − 1 2 n (ρ)Sn∥ , noting that ∥y∥ ≤ 1/2, as required, and that Jn(y) = √ρ 2 ∥H − 1 2 n (ρ)Sn∥ − ρ 4 . We thus obtain that, log ¯Mn ≥ √ρ 2 ∥H − 1 2 n (ρ)Sn∥ − ρ 4 + log n(πn; 1) n(νn; 1
-
[13]
Now, leveraging that by Claim A.2,¯Mn is a positive supermartingale, we obtain by Ville’s inequality that P{∃n: log Mn ≥ y} ≤ e−y . Combining this with our lower bound onlog Mn and upper bound on the log-ratio of normalising constants from Proposition A.4, we obtain the desired result. A.1 Proof of Proposition A.4, log-ratio bound Throughout, letx denote ...
-
[14]
and applying the change of variablesei 7→ H − 1 2 n ei, we have n(ν; 1
-
[15]
= Z ∥x∥≤ 1 2 exp{−∥x∥2 Hn/2} dx (12) = 1p det H ′n Z ∥x∥H−1n ≤ 1 2 exp{−∥x∥2/2} dx (13) ≥ 1p det H ′n Z ∥x∥≤ √ρ 2 exp{−∥x∥2/2} dx (ρΠn ⪯ H ′ n) = 1p det H ′n In(0, 1
-
[16]
(14) Combining (11) and (14), we have that log n(πn) n(νn) ≤ log In(0, 1) In(0, 1
-
[17]
+ 1 2 log(ρ−Dn det H ′ n) . (15) 18 Since ρ−Dn det Hn = 3Dn det(ρ−1⟨S⟩n + I), the second term is bounded by 1 2 log(ρ−Dn det H ′ n) ≤ Dn 2 log 3 +γ(ρ−1⟨S⟩n) . We turn to bounding the first term. LetVn(a) denote the volume of the ball inHn with radius a√ρ. Then, exp{−√ρ/8}Vn( 1
-
[18]
Consequently, In(0, 1) In(0, 1
, I( 1 2 , 1) ≤ exp{−√ρ/8}(Vn(1) − Vn( 1 2)) . Consequently, In(0, 1) In(0, 1
-
[19]
+ In( 1 2 , 1) In(0, 1
-
[20]
= 1 + In( 1 2 , 1) In(0, 1
-
[21]
(16) Since Hn is a Dn-dimensional Hilbert space, there exists a constantC > 0 depending on Dn only, such thatVn(a) = C(a√ρ)Dn. Thus, Vn(1)/Vn( 1
-
[22]
= 2Dn. Substituting this into Eq. (15) and boundinglog 2 + 1 2 log 3 ≤ 2 concludes the proof. B Proof of Theorem 4.1, confidence sequence for logistic regression Observe that the logistic link function satisfies|¨µ(u)| ≤ ˙µ(u) for all u ∈ R, making it what Sun and Tran-Dinh (2019) would call a(2, 1)-generalised self-concordant function. The map f 7→ L ρ n...
work page 2019
-
[23]
For υ(t) = et−1 t , υ(−t)∇2f(x) ⪯ Z 1 0 ∇2f(x + τ(y − x)) dτ ⪯ υ(t)∇2f(x) , where A ⪯ B if B − A is positive
-
[24]
Our proof will also use the following two simple observations
For Υ(t) = et−t−1 t2 , Υ(−t)∥y − x∥2 ∇2f (x) ≤ f(y) − f(x) − ⟨∇f(x), y − x⟩ ≤ Υ(t)∥y − x∥2 ∇2f (x) . Our proof will also use the following two simple observations. Lemma B.2. If x2 ≤ xb + c for b, c > 0, then x ≤ b + √c. Lemma B.3. Define the ‘gradient functional’gn on H by gn(f) = nX i=1 µ(⟨f, Xi⟩)Xi + ρf , 2The results of Sun and Tran-Dinh (2019) are wr...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.