pith. sign in

arxiv: 2506.01250 · v2 · submitted 2025-06-02 · 💻 cs.LG · stat.ML

Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

Pith reviewed 2026-05-19 11:26 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords contextual dueling banditsneural networksvariance-aware explorationregret boundsUCBThompson samplingnonlinear utilities
0
0 comments X

The pith

Variance-aware neural algorithms for contextual dueling bandits achieve sublinear regret of order O(d sqrt(sum sigma_t^2) + sqrt(dT)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops algorithms for the contextual dueling bandit problem that approximate nonlinear utility functions with neural networks. It introduces a variance-aware exploration strategy that accounts for uncertainty in pairwise comparisons using only gradients from the last layer, and applies this under both UCB and Thompson Sampling. A sympathetic reader would care because the resulting regret scales with the actual variance of comparisons rather than a worst-case bound, which can improve efficiency in preference-based tasks such as recommendations or rankings.

Core claim

Under standard assumptions, the proposed algorithms achieve sublinear cumulative average regret of order O(d sqrt(sum_{t=1}^T sigma_t^2) + sqrt(dT)) for sufficiently wide neural networks, where d is the contextual dimension, sigma_t^2 is the variance of comparisons at round t, and T is the total number of rounds. The design balances exploration and exploitation while relying solely on last-layer gradients for the variance-aware component.

What carries the argument

Variance-aware exploration strategy that adaptively accounts for uncertainty in pairwise comparisons while relying only on gradients with respect to the learnable parameters of the last layer.

If this is right

  • The algorithms provide theoretical guarantees under both Upper Confidence Bound and Thompson Sampling frameworks.
  • They apply to nonlinear utility functions via wide neural networks.
  • Empirical validation shows sublinear regret on synthetic nonlinear tasks and real-world tasks while maintaining reasonable computational efficiency.
  • The methods outperform existing approaches in the reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach may extend naturally to recommendation systems where user preferences exhibit complex nonlinear structure.
  • It suggests that deep representations paired with shallow last-layer exploration can reduce computation in high-dimensional preference learning.
  • The variance-dependent term in the regret bound could guide practical tuning when comparison noise varies across rounds.

Load-bearing premise

The neural networks must be sufficiently wide to approximate the unknown nonlinear utility functions and the variance-aware exploration must be effective when computed solely from last-layer gradients.

What would settle it

An experiment showing linear regret growth when using sufficiently wide neural networks on nonlinear utilities or when last-layer gradient variance estimates fail to adapt exploration properly.

Figures

Figures reproduced from arXiv: 2506.01250 by Jaemin Park, Jinje Park, Taejin Paik, Youngmin Oh.

Figure 1
Figure 1. Figure 1: Comparison of cumulative average regret under (a)-(c): synthetic tasks with a context dimension of d = 5 and a total of K = 5 arms. Each experiment is repeated across 20 random seeds. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

In this paper, we address the contextual dueling bandit problem by proposing variance-aware algorithms that leverage neural networks to approximate nonlinear utility functions. Our approach employs a \textit{variance-aware exploration strategy}, which adaptively accounts for uncertainty in pairwise comparisons while relying only on the gradients with respect to the learnable parameters of the last layer. This design effectively balances the exploration--exploitation tradeoff under both the Upper Confidence Bound (UCB) and Thompson Sampling (TS) frameworks. As a result, under standard assumptions, we establish theoretical guarantees showing that our algorithms achieve sublinear cumulative average regret of order $\bigol\lt(d \sqrt{\sum_{t=1}^T \sigma_t^2} + \sqrt{dT}\rt),$ for sufficiently wide neural networks, where $ d $ is the contextual dimension, $ \sigma_t^2 $ the variance of comparisons at round $ t $, and $ T $ the total number of rounds. We also empirically validate that our approach offers reasonable computational efficiency and achieves sublinear regret on both synthetic tasks with nonlinear utilities and real-world tasks, outperforming existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes variance-aware UCB and Thompson Sampling algorithms for contextual dueling bandits. These use neural networks to model nonlinear utility functions but compute the exploration bonus exclusively from last-layer gradients. Under standard assumptions, the algorithms are claimed to achieve sublinear cumulative average regret of order O(d sqrt(sum_{t=1}^T sigma_t^2) + sqrt(d T)) when the networks are sufficiently wide. Empirical results on synthetic nonlinear tasks and real-world datasets are reported to show sublinear regret and outperformance relative to prior methods.

Significance. If the central regret bound holds, the work would be significant for demonstrating that deep representations can be paired with computationally cheap last-layer-only variance estimation to obtain adaptive, variance-aware regret bounds in dueling settings. This design choice could improve scalability over methods that propagate uncertainty through all layers. The explicit dependence on the per-round comparison variance sigma_t^2 is a positive feature of the stated bound.

major comments (1)
  1. [Abstract] Abstract and the regret statement: the bound O(d sqrt(sum sigma_t^2) + sqrt(d T)) is derived under the assumption that last-layer gradients alone suffice to set the variance-aware widths. No explicit width-dependent or depth-dependent error term is supplied to absorb residual uncertainty from earlier layers, which would be required for the concentration inequalities underlying the UCB/TS analysis to remain valid. This is load-bearing for the central theoretical claim.
minor comments (3)
  1. [Theoretical Analysis] Clarify the precise definition of 'sufficiently wide' networks and state the minimal width scaling required for the approximation and concentration arguments to go through.
  2. [Experiments] In the experimental section, report the network depths and widths used, together with the precise form of the last-layer gradient computation for the bonus.
  3. [Abstract] The notation in the regret expression uses non-standard delimiters; replace with standard big-O notation for readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying this important point regarding the theoretical justification. We address the major comment below and will incorporate clarifications in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the regret statement: the bound O(d sqrt(sum sigma_t^2) + sqrt(d T)) is derived under the assumption that last-layer gradients alone suffice to set the variance-aware widths. No explicit width-dependent or depth-dependent error term is supplied to absorb residual uncertainty from earlier layers, which would be required for the concentration inequalities underlying the UCB/TS analysis to remain valid. This is load-bearing for the central theoretical claim.

    Authors: We agree that an explicit discussion of the approximation error from earlier layers would strengthen the rigor of the presentation. Our analysis is conducted in the wide-network regime where, under standard NTK assumptions, the hidden-layer representations become effectively fixed after random initialization, and the uncertainty in the utility estimates is captured by the last-layer parameters. The width-dependent terms that control the quality of this approximation appear in the concentration arguments but were not highlighted in the abstract or main regret statement. We will revise the abstract, the statement of the main theorem, and the proof sketch to include an explicit remark on the width-dependent error term that absorbs residual uncertainty from earlier layers, ensuring the concentration inequalities remain valid. This does not change the leading-order regret bound but makes the assumptions more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound derived from standard assumptions without reduction to inputs

full rationale

The paper claims a sublinear regret bound O(d sqrt(sum sigma_t^2) + sqrt(d T)) under standard assumptions for sufficiently wide neural networks, using variance-aware exploration based on last-layer gradients. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations are present in the abstract or described derivation. The last-layer focus is an explicit design choice justified by computational efficiency and wide-network approximation, not a circular redefinition of the bound itself. The result is presented as following from independent theoretical analysis rather than by construction from the algorithm's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard contextual dueling bandit assumptions and the approximation power of sufficiently wide neural networks; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Standard assumptions for contextual dueling bandits hold
    Explicitly invoked for the regret bound in the abstract
  • domain assumption Neural networks are sufficiently wide to approximate nonlinear utilities
    Required for the stated theoretical guarantees

pith-pipeline@v0.9.0 · 5733 in / 1251 out tokens · 72630 ms · 2026-05-19T11:26:03.213117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

  2. [2]

    Ee-net: Exploitation-exploration neural networks in contextual bandits

    Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. Ee-net: Exploitation-exploration neural networks in contextual bandits. arXiv preprint arXiv:2110.03177, 2021

  3. [3]

    Stochastic contextual dueling bandits under linear stochastic transitivity models

    Viktor Bengs, Aadirupa Saha, and Eyke Hüllermeier. Stochastic contextual dueling bandits under linear stochastic transitivity models. In International Conference on Machine Learning, pages 1764–1786. PMLR, 2022

  4. [4]

    Matrix analysis, volume 169

    Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013

  5. [5]

    Variance-aware linear ucb with deep representa- tion for neural contextual bandits

    Ha Manh Bui, Enrique Mallada, and Anqi Liu. Variance-aware linear ucb with deep representa- tion for neural contextual bandits. arXiv preprint arXiv:2411.05979, 2024

  6. [6]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks

    Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019

  7. [7]

    Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

    Kani Chen, Inchi Hu, and Zhiliang Ying. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs.The Annals of Statistics, 27(4):1155–1163, 1999

  8. [8]

    Federated neural bandits

    Zhongxiang Dai, Yao Shu, Arun , Flint Xiaofeng Fan, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated neural bandits. arXiv preprint arXiv:2205.14309, 2022

  9. [9]

    Contextual bandits with online neural regression

    Rohan Deb, Yikun Ban, Shiliang Zuo, Jingrui He, and Arindam Banerjee. Contextual bandits with online neural regression. arXiv preprint arXiv:2312.07145, 2023

  10. [10]

    Variance-aware regret bounds for stochastic contextual dueling bandits

    Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, and Quanquan Gu. Variance-aware regret bounds for stochastic contextual dueling bandits. In The Twelfth International Conference on Learning Representations, 2023

  11. [11]

    Contextual dueling bandits

    Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR, 2015

  12. [12]

    Elliptic partial differential equations of second order, volume 224

    David Gilbarg, Neil S Trudinger, David Gilbarg, and NS Trudinger. Elliptic partial differential equations of second order, volume 224. Springer, 1977

  13. [13]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012

  14. [14]

    Mm algorithms for generalized bradley-terry models

    David R Hunter. Mm algorithms for generalized bradley-terry models. The annals of statistics, 32(1):384–406, 2004

  15. [15]

    Combinatorial neural bandits

    Taehyun Hwang, Kyuwook Chai, and Min-hwan Oh. Combinatorial neural bandits. In Interna- tional Conference on Machine Learning, pages 14203–14236. PMLR, 2023

  16. [16]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems , 31, 2018

  17. [17]

    Regret Bounds for Non-decomposable Metrics with Missing Labels

    Prateek Jain and Nagarajan Natarajan. Regret bounds for non-decomposable metrics with missing labels. arXiv preprint arXiv:1606.02077, 2016

  18. [18]

    Improved regret analysis for variance- adaptive linear bandits and horizon-free linear mixture mdps

    Yeoneung Kim, Insoon Yang, and Kwang-Sung Jun. Improved regret analysis for variance- adaptive linear bandits and horizon-free linear mixture mdps. Advances in Neural Information Processing Systems, 35:1060–1072, 2022

  19. [19]

    Randomized exploration in generalized linear bandits

    Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 2066–2076. PMLR, 2020

  20. [20]

    Bandit algorithms

    Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020

  21. [21]

    Provably optimal algorithms for generalized linear contextual bandits

    Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning , pages 2071–2080. PMLR, 2017

  22. [22]

    Feel-good thompson sam- pling for contextual dueling bandits.arXiv preprint arXiv:2404.06013,

    Xuheng Li, Heyang Zhao, and Quanquan Gu. Feel-good thompson sampling for contextual dueling bandits. arXiv preprint arXiv:2404.06013, 2024. 10

  23. [23]

    Individual choice behavior: A theoretical analysis

    R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005

  24. [24]

    Principles of mathematical analysis, volume 3

    Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964

  25. [25]

    Optimal algorithms for stochastic contextual preference bandits

    Aadirupa Saha. Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems, 34:30050–30062, 2021

  26. [26]

    Adversarial dueling bandits

    Aadirupa Saha, Tomer Koren, and Yishay Mansour. Adversarial dueling bandits. InInternational Conference on Machine Learning, pages 9235–9244. PMLR, 2021

  27. [27]

    A law of comparative judgment

    Louis L Thurstone. A law of comparative judgment. In Scaling, pages 81–92. Routledge, 2017

  28. [28]

    Old dog learns new tricks: Randomized ucb for bandit problems

    Sharan Vaswani, Abbas Mehrabian, Audrey Durand, and Branislav Kveton. Old dog learns new tricks: Randomized ucb for bandit problems. arXiv preprint arXiv:1910.04928, 2019

  29. [29]

    Neural dueling bandits, 2024

    Arun Verma, Zhongxiang Dai, Xiaoqiang Lin, Patrick Jaillet, and Bryan Kian Hsiang Low. Neural dueling bandits, 2024. URL https://arxiv.org/abs/2407.17112

  30. [30]

    Neural contextual bandits with deep representation and shallow exploration, 2020

    Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration, 2020. URL https://arxiv.org/abs/2012.01780

  31. [31]

    The k-armed dueling bandits problem

    Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012

  32. [32]

    Mathematical analysis of machine learning algorithms

    Tong Zhang. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023

  33. [33]

    Neural thompson sampling,

    Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling,

  34. [34]

    URL https://arxiv.org/abs/2010.00827

  35. [35]

    Improved variance-aware confidence sets for linear bandits and linear mixture mdp

    Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du. Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34:4342–4355, 2021

  36. [36]

    Horizon-free reinforcement learning in polynomial time: the power of stationary policies

    Zihan Zhang, Xiangyang Ji, and Simon Du. Horizon-free reinforcement learning in polynomial time: the power of stationary policies. In Conference on Learning Theory, pages 3858–3904. PMLR, 2022

  37. [37]

    Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency

    Heyang Zhao, Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The Thirty Sixth Annual Conference on Learning Theory , pages 4977–5020. PMLR, 2023

  38. [38]

    Computationally efficient horizon-free reinforcement learning for linear mixture mdps

    Dongruo Zhou and Quanquan Gu. Computationally efficient horizon-free reinforcement learning for linear mixture mdps. Advances in neural information processing systems, 35:36337–36349, 2022

  39. [39]

    Neural contextual bandits with ucb-based exploration

    Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492–11502. PMLR, 2020

  40. [40]

    Nearly minimax optimal reinforcement learning for linear mixture markov decision processes

    Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021

  41. [41]

    Provably efficient reinforcement learning for discounted mdps with feature mapping

    Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021. 11 Supplementary Material: Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration Organization This document serves as ...

  42. [42]

    ∥∆ϕt,kt,1,kt,2 ∥V −1 t−1 + (M′ 1(t) + N ′

  43. [43]

    Therefore, choosing at = M′ 1(t) + N ′ 1, we can obtain 2ra t = eO M2 + M3 + (M′ 1(t) + N ′

    ∥∆ϕt,k∗ t ,kt,1 ∥V −1 t−1 , (C.23) where the last two terms result from Lemma B.1. Therefore, choosing at = M′ 1(t) + N ′ 1, we can obtain 2ra t = eO M2 + M3 + (M′ 1(t) + N ′

  44. [44]

    (C.24) 29 where N ′ 1, M′ 1, M2, M3 are defined in (B.29), (B.30), (C.20), (C.21)

    ∥∆ϕt,kt,1,kt,2 ∥V −1 t−1 . (C.24) 29 where N ′ 1, M′ 1, M2, M3 are defined in (B.29), (B.30), (C.20), (C.21). In a similar manner to the asymmetric arm selection, we can obatain equivalent inequalities under the two symmetric arm selections. In other words, there exists C > 0 such that 2ra t = eO M2 + M3 + (M′ 1(t) + N ′

  45. [45]

    ∥∆ϕt,kt,1,kt,2 ∥V −1 t−1 , (C.25) independently of the arm selection strategies, if at = M′ 1(t) + N ′

  46. [46]

    Note that N ′ 1 = eO( √ d). Therefore, if we set ζt = 1, which is the variance-agnostic case, by summing up to T , and applying Lemma A.7, we can show that 2 TX t=1 ra t = eO d √ T + d3 T M−1/6 + T M−1/6 2 /( √ T ) ∥u −eu∥4/3 H−1 = eO d √ T + d3 T M−1/6 ∥u −eu∥4/3 H−1 . (C.26) The last term results from Eq. (B.25). Thus, we can prove Corollary 1. ■ Proof ...

  47. [47]

    ts " : 19utilities_j1 = utilities_j1 . to (

    reshape ( -1 , 1) 10co nt ex t_ ar ms_ li st . append ( context_arms ) Listing 1: Pseudocode for constructing Θ and context sets Xt. E.1.2 Real-World Dataset Moreover, we consider real-world datasets provided by the UCI machine learning repository 4 — Statlog, Magic, and Covertype—each consisting of feature vectors associated with one of multiple possible...