pith. sign in

arxiv: 2507.20982 · v2 · pith:RMUCJXOZnew · submitted 2025-07-28 · 🧮 math.PR · math.ST· stat.TH

Bernstein-type dimension-free concentration for self-normalised martingales

Pith reviewed 2026-05-22 00:38 UTC · model grok-4.3

classification 🧮 math.PR math.STstat.TH
keywords concentration inequalityself-normalised martingaleBernstein inequalitydimension-free boundlogistic regressionHilbert spacebanditsinformation gain
0
0 comments X

The pith

Self-normalised martingales satisfy a dimension-free Bernstein-type tail inequality controlled by information gain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a concentration inequality for self-normalised martingales that remains valid without depending on the dimension of the space. A sympathetic reader would care because it enables precise uncertainty quantification in settings where covariates live in infinite-dimensional Hilbert spaces and are selected adaptively during the process. The inequality normalises using the predictable quadratic variation while setting the radius via the information gain of the observed covariance. This leads directly to new confidence sequences and regret bounds in adaptive statistical problems.

Core claim

We introduce a dimension-free Bernstein-type tail inequality for self-normalised martingales, where the normalisation uses the predictable quadratic variation and the radius depends on the information gain of the observed covariance.

What carries the argument

The dimension-free Bernstein-type tail inequality for self-normalised martingales, which bounds deviations using information gain instead of ambient dimension.

If this is right

  • Ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates.
  • Instance-adaptive regret bounds for Hilbert-armed logistic bandits.
  • Improved tail bounds that hold uniformly over adaptive choices without dimension penalties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such inequalities could apply to other adaptive algorithms in functional analysis or kernel methods.
  • Testing the bound in finite-dimensional approximations of Hilbert spaces would validate its practical utility.
  • The approach might extend to other types of self-normalised processes beyond martingales.

Load-bearing premise

The self-normalised martingales must have their tail bounds controlled by the information gain of the observed covariance in a Hilbert space under adaptive covariate selection.

What would settle it

Constructing an adaptive sequence of Hilbert-valued covariates where the self-normalised martingale deviates beyond the information-gain-controlled radius with positive probability.

read the original abstract

We introduce a dimension-free Bernstein-type tail inequality for self-normalised martingales, where the normalisation uses the predictable quadratic variation and the radius depends on the information gain of the observed covariance. As applications, we provide ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates, and give instance-adaptive regret bounds for Hilbert-armed logistic bandits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a dimension-free Bernstein-type tail inequality for self-normalised martingales in Hilbert space. Normalisation uses the predictable quadratic variation, and the radius is controlled by the information gain of the observed covariance under adaptive covariate selection. Applications include ellipsoidal confidence sequences for logistic regression with adaptively chosen Hilbert-valued covariates and instance-adaptive regret bounds for Hilbert-armed logistic bandits.

Significance. If the central inequality holds, the result offers a useful advance for concentration inequalities in infinite-dimensional adaptive settings. The information-gain radius provides an adaptive, dimension-free control that could improve bounds in online learning and bandit problems. The direct derivation via exponential-martingale arguments (rather than reduction to prior fitted quantities) is a strength.

major comments (2)
  1. [§3, Theorem 3.1] §3, Theorem 3.1: the proof that the adaptive dependence is fully absorbed into the information-gain term without introducing hidden dimension dependence or boundedness assumptions on the covariates is only sketched; a fully expanded argument is needed to confirm the dimension-free claim.
  2. [§5.1, Eq. (18)] §5.1, Eq. (18): the self-normalised process is asserted to remain a martingale under adaptive Hilbert-valued covariate choice, but the verification step that the compensator remains the predictable quadratic variation is not detailed enough to rule out leakage of adaptivity into the radius.
minor comments (2)
  1. [§2] Notation for the information-gain functional is introduced late; moving the definition to §2 would improve readability.
  2. [§6] The logistic-regression application in §6 assumes the link function satisfies standard Lipschitz conditions, but an explicit statement of the constant would help readers replicate the ellipsoidal sequence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the manuscript. We address each major comment below and have prepared a revised version incorporating the requested expansions and clarifications.

read point-by-point responses
  1. Referee: [§3, Theorem 3.1] §3, Theorem 3.1: the proof that the adaptive dependence is fully absorbed into the information-gain term without introducing hidden dimension dependence or boundedness assumptions on the covariates is only sketched; a fully expanded argument is needed to confirm the dimension-free claim.

    Authors: We agree that the original proof of Theorem 3.1 was presented in a condensed form. In the revised manuscript we have expanded the argument in Section 3 into a complete, self-contained derivation. The new text explicitly verifies that any dependence arising from the adaptive choice of Hilbert-valued covariates is captured entirely by the information-gain term of the observed covariance operator. The steps rely only on the predictability of the quadratic variation and the definition of information gain; no auxiliary boundedness assumptions on the covariates are added, and the argument remains free of hidden dimension dependence by working directly with the trace-class operator norm. revision: yes

  2. Referee: [§5.1, Eq. (18)] §5.1, Eq. (18): the self-normalised process is asserted to remain a martingale under adaptive Hilbert-valued covariate choice, but the verification step that the compensator remains the predictable quadratic variation is not detailed enough to rule out leakage of adaptivity into the radius.

    Authors: We thank the referee for highlighting this point. The verification in the original Section 5.1 was indeed too brief. The revised manuscript now contains a detailed check immediately following Equation (18). We show that, conditional on the filtration generated by the past covariates and responses, the increment of the self-normalised process has conditional expectation zero and that its quadratic variation process is predictable by construction of the covariate-selection rule. This establishes that the compensator coincides with the predictable quadratic variation and that no adaptivity leaks into the radius term of the resulting concentration bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives a dimension-free Bernstein-type tail inequality for self-normalised martingales directly from the predictable quadratic variation and a standard exponential-martingale argument that absorbs adaptive dependence into an information-gain term. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or imported uniqueness theorem; the central claim and its applications (ellipsoidal confidence sequences, instance-adaptive regret bounds) rest on independent probabilistic arguments rather than renaming or re-deriving prior fitted quantities. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard properties of martingales and quadratic variation together with domain assumptions about information gain in Hilbert spaces; no free parameters or new invented entities are indicated in the abstract.

axioms (2)
  • standard math Standard properties of predictable quadratic variation for martingales
    Used to define the self-normalisation in the tail inequality.
  • domain assumption Information gain of observed covariance is well-defined for adaptively chosen Hilbert-valued covariates
    Controls the radius in the stated applications to logistic regression and bandits.

pith-pipeline@v0.9.0 · 5582 in / 1456 out tokens · 73304 ms · 2026-05-22T00:38:36.494222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

    cs.LG 2026-04 unverdicted novelty 8.0

    SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.

  2. Vector-valued self-normalized concentration inequalities beyond sub-Gaussianity

    stat.ML 2025-11 unverdicted novelty 7.0

    Derives vector-valued self-normalized concentration bounds for light-tailed processes beyond sub-Gaussianity, with applications to online linear regression and linear bandits.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 2 Pith papers

  1. [1]

    Cited on page 9. K. Azuma. Weighted sums of certain dependent random variables.Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967. Cited on page 3. 12 D. Calandriello, A. Lazaric and M. Valko. Second-order kernel online convex optimization with adaptive sketching. InInternational Conference on Machine Learning, 2017. Cited on page 5. S. R. Ch...

  2. [2]

    Cited on pages 4, 11. S. Filippi, O. Cappe, A. Garivier and C. Szepesvári. Parametric bandits: The generalized linear case.Advances in Neural Information Processing Systems, 23, 2010. Cited on pages 4, 11. H. Flynn and D. Reeb. Tighter confidence bounds for sequential kernel regression.arXiv preprint arXiv:2403.12732, 2024. Cited on pages 5, 9. D. A. Free...

  3. [3]

    Cited on page 9. T. L. Lai and H. Robbins. Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes.Zeitschrift für Wahrscheinlichkeitstheorie und ver- wandte Gebiete, 56(3):329–360, 1981. Cited on page 3. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applica- tions to identificat...

  4. [4]

    Cited on page 3. T. Lattimore and C. Szepesvári.Bandit algorithms. 2020. Cited on page 3. M. Ledoux and M. Talagrand.Probability in Banach Spaces: Isoperimetry and Processes, volume 23. 1991. Cited on page 3. J. Lee, S.-Y. Yun and K.-S. Jun. A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Inform...

  5. [5]

    Cited on page 5. S. Vakili, J. Scarlett and T. Javidi. Open problem: Tight online confidence intervals for RKHS elements. InConference on Learning Theory, 2021. Cited on pages 9, 12. M. Valko, N. Korda, R. Munos, I. Flaounas and N. Cristianini. Finite-Time Analysis of Kernelised Contextual Bandits. InUncertainty in Artificial Intelligence, 2013. Cited on ...

  6. [6]

    Let πn|B be the restriction ofπn to B and let n(πn; 1) = R ∥x∥≤1 πn

    Let πn be the isotropic, centred Gaussian measure with varianceρ−1 on H1 ⊕ · · · ⊕Hn. Let πn|B be the restriction ofπn to B and let n(πn; 1) = R ∥x∥≤1 πn

  7. [7]

    Let µn| 1 2 B be the restriction ofµn to 1 2 B = {x: ∥x∥ ≤ 1 2 } and let n(µn; 1

    Let µn be the probability measure with densityµn(dx) = exp{− 1 2 ∥x∥2 Hn}. Let µn| 1 2 B be the restriction ofµn to 1 2 B = {x: ∥x∥ ≤ 1 2 } and let n(µn; 1

  8. [8]

    Furthermore, for eachn ∈ N+, let ¯M0 = 1 and ¯Mn := ¯Mn(πn|B) and define Jn(x) = ⟨Sn, x⟩ − ⟨x, Hnx⟩/2

    = R ∥x∥≤ 1 2 µn(dx). Furthermore, for eachn ∈ N+, let ¯M0 = 1 and ¯Mn := ¯Mn(πn|B) and define Jn(x) = ⟨Sn, x⟩ − ⟨x, Hnx⟩/2 . Then, for anyy ∈ H satisfying ∥y∥ ≤ 1 2, log ¯Mn = Jn(y; ρ) + log n(µn; 1 2) n(πn; 1) , ∀n ∈ N+ . Proof. For eachn ∈ N+, let n(πn; 1) = R ∥x∥≤1 πn and observe that ¯Mn = 1 n(πn; 1) Z ∥x∥≤1 exp{Jn(x)} dx . Let gn = ∇Jn(y) and observe...

  9. [9]

    · 1 n(µn; 1 2) Z exp{⟨x, g⟩}µn| 1 2 B(dx) (definition of µn| 1 2 B) ≥ n(µn; 1

  10. [10]

    exp ( 1 n(µn; 1 2) Z ⟨x, g⟩ µn| 1 2 B(dx) ) (Jensen’s inequality) = n(µn; 1

  11. [11]

    For any n ∈ N+, log n(πn; 1) n(νn; 1

    (µn| 1 2 B is zero-mean) Proposition A.4. For any n ∈ N+, log n(πn; 1) n(νn; 1

  12. [12]

    (9) We now state the proof of Theorem 3.1

    ≤ 2Dn + γ(ρ−1⟨S⟩n) . (9) We now state the proof of Theorem 3.1. Proposition A.4 is proven in Appendix A.1. 17 Proof of Theorem 3.1.We apply Claim A.3 with y = √ρH −1 n (ρ)Sn 2∥H − 1 2 n (ρ)Sn∥ , noting that ∥y∥ ≤ 1/2, as required, and that Jn(y) = √ρ 2 ∥H − 1 2 n (ρ)Sn∥ − ρ 4 . We thus obtain that, log ¯Mn ≥ √ρ 2 ∥H − 1 2 n (ρ)Sn∥ − ρ 4 + log n(πn; 1) n(νn; 1

  13. [13]

    Combining this with our lower bound onlog Mn and upper bound on the log-ratio of normalising constants from Proposition A.4, we obtain the desired result

    Now, leveraging that by Claim A.2,¯Mn is a positive supermartingale, we obtain by Ville’s inequality that P{∃n: log Mn ≥ y} ≤ e−y . Combining this with our lower bound onlog Mn and upper bound on the log-ratio of normalising constants from Proposition A.4, we obtain the desired result. A.1 Proof of Proposition A.4, log-ratio bound Throughout, letx denote ...

  14. [14]

    and applying the change of variablesei 7→ H − 1 2 n ei, we have n(ν; 1

  15. [15]

    = Z ∥x∥≤ 1 2 exp{−∥x∥2 Hn/2} dx (12) = 1p det H ′n Z ∥x∥H−1n ≤ 1 2 exp{−∥x∥2/2} dx (13) ≥ 1p det H ′n Z ∥x∥≤ √ρ 2 exp{−∥x∥2/2} dx (ρΠn ⪯ H ′ n) = 1p det H ′n In(0, 1

  16. [16]

    (14) Combining (11) and (14), we have that log n(πn) n(νn) ≤ log In(0, 1) In(0, 1

  17. [17]

    (15) 18 Since ρ−Dn det Hn = 3Dn det(ρ−1⟨S⟩n + I), the second term is bounded by 1 2 log(ρ−Dn det H ′ n) ≤ Dn 2 log 3 +γ(ρ−1⟨S⟩n)

    + 1 2 log(ρ−Dn det H ′ n) . (15) 18 Since ρ−Dn det Hn = 3Dn det(ρ−1⟨S⟩n + I), the second term is bounded by 1 2 log(ρ−Dn det H ′ n) ≤ Dn 2 log 3 +γ(ρ−1⟨S⟩n) . We turn to bounding the first term. LetVn(a) denote the volume of the ball inHn with radius a√ρ. Then, exp{−√ρ/8}Vn( 1

  18. [18]

    Consequently, In(0, 1) In(0, 1

    , I( 1 2 , 1) ≤ exp{−√ρ/8}(Vn(1) − Vn( 1 2)) . Consequently, In(0, 1) In(0, 1

  19. [19]

    + In( 1 2 , 1) In(0, 1

  20. [20]

    = 1 + In( 1 2 , 1) In(0, 1

  21. [21]

    Thus, Vn(1)/Vn( 1

    (16) Since Hn is a Dn-dimensional Hilbert space, there exists a constantC > 0 depending on Dn only, such thatVn(a) = C(a√ρ)Dn. Thus, Vn(1)/Vn( 1

  22. [22]

    Substituting this into Eq

    = 2Dn. Substituting this into Eq. (15) and boundinglog 2 + 1 2 log 3 ≤ 2 concludes the proof. B Proof of Theorem 4.1, confidence sequence for logistic regression Observe that the logistic link function satisfies|¨µ(u)| ≤ ˙µ(u) for all u ∈ R, making it what Sun and Tran-Dinh (2019) would call a(2, 1)-generalised self-concordant function. The map f 7→ L ρ n...

  23. [23]

    For υ(t) = et−1 t , υ(−t)∇2f(x) ⪯ Z 1 0 ∇2f(x + τ(y − x)) dτ ⪯ υ(t)∇2f(x) , where A ⪯ B if B − A is positive

  24. [24]

    Our proof will also use the following two simple observations

    For Υ(t) = et−t−1 t2 , Υ(−t)∥y − x∥2 ∇2f (x) ≤ f(y) − f(x) − ⟨∇f(x), y − x⟩ ≤ Υ(t)∥y − x∥2 ∇2f (x) . Our proof will also use the following two simple observations. Lemma B.2. If x2 ≤ xb + c for b, c > 0, then x ≤ b + √c. Lemma B.3. Define the ‘gradient functional’gn on H by gn(f) = nX i=1 µ(⟨f, Xi⟩)Xi + ρf , 2The results of Sun and Tran-Dinh (2019) are wr...