Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

Jiazheng Sun; Pan Xu; Weixin Wang

arxiv: 2510.10730 · v2 · submitted 2025-10-12 · 💻 cs.LG · cs.AI· stat.ML

Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

Jiazheng Sun , Weixin Wang , Pan Xu This is my paper

Pith reviewed 2026-05-18 07:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords contextual banditsensemble samplingregret boundsgeneralized linear banditsneural contextual banditsrandomized explorationanytime algorithms

0 comments

The pith

Ensemble sampling with perturbed maximum likelihood estimators achieves high-probability regret bounds for nonlinear contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits. It maintains multiple estimators for the reward model parameters by performing maximum likelihood estimation on randomly perturbed data. This setup supports proving high-probability frequentist regret bounds of order d to the 3/2 times square root of T plus d to the 4 for generalized linear models and order tilde d to the 3/2 times square root of T for neural models. The framework also includes anytime versions of the algorithms that do not require knowing the total number of rounds in advance. A sympathetic reader would care because the results make randomized exploration methods both provably efficient and practical for reward models that are nonlinear in the features.

Core claim

By maintaining an ensemble of estimators obtained via maximum likelihood estimation on randomly perturbed data, GLM-ES and Neural-ES achieve high-probability frequentist regret bounds of tilde O of d to the 3/2 square root of T plus d to the 4 for GLM-ES and tilde O of tilde d to the 3/2 square root of T for Neural-ES, while their anytime variants remove the fixed time horizon assumption.

What carries the argument

The ensemble of estimators constructed via maximum likelihood estimation on randomly perturbed data, which supplies the diversity for exploration and supports concentration bounds tailored to nonlinear reward models.

If this is right

The GLM-ES regret bound matches the state-of-the-art result for randomized exploration algorithms in the generalized linear bandit setting.
The Neural-ES bound depends on the effective dimension of the neural tangent kernel matrix instead of the raw parameter count.
Anytime variants of both algorithms function effectively when the total number of rounds T is unknown in advance.
Empirical evaluations show that GLM-ES, Neural-ES, and their anytime versions achieve strong practical performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The perturbation technique for building diverse estimators might extend to other nonlinear bandit formulations beyond generalized linear and neural models.
When the effective dimension stays small, Neural-ES could scale exploration to feature spaces that would otherwise be intractable.
The handling of nonlinear concentration challenges may inform analyses of related randomized methods such as variants of Thompson sampling.

Load-bearing premise

Maximum likelihood estimation on randomly perturbed data produces sufficiently diverse and accurate estimators for the nonlinear reward models to enable the required concentration and regret properties.

What would settle it

A controlled experiment or simulation in which the observed cumulative regret for GLM-ES or Neural-ES exceeds the stated bounds by more than a constant factor across repeated independent runs with known parameters.

read the original abstract

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (GLM-ES) for generalized linear bandits and Neural Ensemble Sampling (Neural-ES) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\widetilde{O}(d^{3/2} \sqrt{T} + d^{4})$ for GLM-ES and $\widetilde{{O}}(\widetilde{d}^{3/2} \sqrt{T})$ for Neural-ES, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel (NTK) matrix and $T$ is the number of rounds. The regret bound of GLM-ES matches the state-of-the-art result of randomized exploration algorithms in generalized linear bandit setting. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumption by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate GLM-ES, Neural-ES and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GLM-ES and Neural-ES as ensemble sampling algorithms for nonlinear contextual bandits. Both maintain multiple reward model estimators via MLE on randomly perturbed data. The central claims are high-probability frequentist regret bounds of ~O(d^{3/2} sqrt(T) + d^4) for GLM-ES (matching SOTA for randomized exploration in GLBs) and ~O(~d^{3/2} sqrt(T)) for Neural-ES (using NTK effective dimension), plus anytime variants that remove the fixed-horizon assumption and supporting empirical results.

Significance. If the bounds and concentration arguments hold, the work provides a unified, provable randomized exploration framework for nonlinear bandits, with the GLM result matching known SOTA and the Neural result giving the first such bound in terms of effective dimension. The nonlinearity-specific techniques and practical anytime algorithms are clear strengths; the empirical evaluation adds evidence of practicality.

major comments (2)

[theoretical analysis for GLM-ES] Theoretical analysis (GLM-ES concentration): the high-probability bounds for perturbed MLE estimators require the random perturbation to be scaled relative to the minimum eigenvalue of the information matrix to guarantee the 1/sqrt(t) rate and sufficient diversity. The current presentation does not make this scaling explicit, which is load-bearing for the instantaneous regret decomposition and the origin of the additive d^4 term.
[proof of regret bounds] Regret bound derivation: the extra d^4 term in the GLM-ES bound ~O(d^{3/2} sqrt(T) + d^4) is attributed to union bound or perturbation analysis; a concrete step showing whether this term is removable by tighter control on the perturbation variance (relative to sample size and Gram matrix conditioning) would strengthen the result.

minor comments (2)

[notation and preliminaries] The definition of effective dimension ~d for the NTK matrix should be stated explicitly in the main text with a pointer to the precise matrix whose eigenvalues are used.
[experiments] In the experimental section, parameter choices for the perturbation variance and ensemble size should be reported with sensitivity checks to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We have carefully reviewed the points raised regarding the GLM-ES concentration analysis and the origin of the additive term in the regret bound. Below we respond point by point, indicating the revisions we will incorporate.

read point-by-point responses

Referee: Theoretical analysis (GLM-ES concentration): the high-probability bounds for perturbed MLE estimators require the random perturbation to be scaled relative to the minimum eigenvalue of the information matrix to guarantee the 1/sqrt(t) rate and sufficient diversity. The current presentation does not make this scaling explicit, which is load-bearing for the instantaneous regret decomposition and the origin of the additive d^4 term.

Authors: We agree that an explicit statement of the perturbation scaling is necessary for clarity. In the proof (Appendix B), the perturbation variance is set proportionally to the inverse square root of the minimum eigenvalue of the Gram matrix at each round to ensure both the 1/sqrt(t) convergence rate for each estimator and sufficient diversity across the ensemble. We will revise the main text (Section 3.1 and the statement of Theorem 1) to state this scaling choice explicitly and to briefly indicate how it enters the instantaneous regret decomposition. This change will also make the provenance of the additive d^4 term more transparent. revision: yes
Referee: Regret bound derivation: the extra d^4 term in the GLM-ES bound ~O(d^{3/2} sqrt(T) + d^4) is attributed to union bound or perturbation analysis; a concrete step showing whether this term is removable by tighter control on the perturbation variance (relative to sample size and Gram matrix conditioning) would strengthen the result.

Authors: The d^4 term arises from a union bound over T rounds together with a crude bound on the perturbation-induced deviation that holds uniformly over all possible Gram-matrix condition numbers. While a more refined variance schedule that adapts to the instantaneous minimum eigenvalue could in principle reduce the term, any such schedule must still guarantee a minimum level of diversity to obtain the d^{3/2} sqrt(T) exploration cost; overly aggressive shrinkage of the perturbation variance risks collapsing the ensemble and inflating the leading term. A concrete route to removal would require replacing the union bound with a martingale concentration inequality that directly controls the cumulative perturbation effect, but adapting such an inequality to the nonlinear MLE setting while preserving the high-probability guarantee is technically involved and lies beyond the scope of the present work. In the revision we will add a short remark after Theorem 1 discussing the origin of the additive term and noting that its removal is an open technical question. revision: partial

Circularity Check

0 steps flagged

No significant circularity; bounds derived from standard concentration analysis

full rationale

The derivation proceeds from concentration properties of MLE on randomly perturbed data combined with standard high-probability regret decomposition tools for contextual bandits. The final bounds are not obtained by fitting a parameter to the target quantity and renaming it a prediction, nor do they reduce to a self-citation chain or an ansatz imported from the authors' prior work. The effective dimension ~d is a conventional NTK quantity and does not create a definitional loop within the paper's own equations. The analysis is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard concentration inequalities for MLE, properties of the neural tangent kernel, and the assumption that random perturbations produce useful diversity in the ensemble. No new free parameters or invented entities are introduced in the abstract; the analysis uses existing tools from bandit theory and kernel methods.

axioms (2)

domain assumption Maximum likelihood estimators on randomly perturbed data yield sufficiently concentrated estimates for the nonlinear reward function.
Invoked to justify the ensemble diversity and the subsequent high-probability regret analysis.
standard math Standard martingale concentration and self-normalized bounds extend to the GLM and NTK settings with the stated perturbation.
Used to obtain the high-probability frequentist regret bounds.

pith-pipeline@v0.9.0 · 5773 in / 1569 out tokens · 29191 ms · 2026-05-18T07:08:07.997181+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We maintain an ensemble of perturbed models, each with different perturbed history Dj_t ... θ^j_t ← argmin_θ L_nonlin(θ; D_t)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-probability frequentist regret bounds of ~O(d^{3/2} sqrt(T) + d^4) for GLM-ES

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.