Neural Exploitation and Exploration of Contextual Bandits

Arindam Banerjee; Jingrui He; Yikun Ban; Yuchen Yan

arxiv: 2305.03784 · v3 · submitted 2023-05-05 · 💻 cs.LG

Neural Exploitation and Exploration of Contextual Bandits

Yikun Ban , Yuchen Yan , Arindam Banerjee , Jingrui He This is my paper

Pith reviewed 2026-05-24 08:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual banditsneural networksexploitation-explorationregret boundsEE-Netmulti-armed banditsadaptive exploration

0 comments

The pith

EE-Net uses separate neural networks for reward learning and potential-gain estimation to obtain an instance-based square-root regret bound in contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EE-Net as a neural approach to the exploitation-exploration tradeoff in contextual multi-armed bandits. One network models the reward function while a second network adaptively estimates the potential gain from each action relative to the current model. This replaces explicit large-deviation statistical bounds with learned exploration signals. The authors derive an instance-dependent Õ(√T) regret upper bound and report that the method outperforms linear and neural baselines on real-world datasets.

Core claim

EE-Net consists of an exploitation network that learns the reward function and an exploration network that learns potential gains compared to the current estimate. The joint architecture produces an instance-based Õ(√T) regret upper bound without relying on Thompson Sampling or UCB-style concentration inequalities, and it yields lower empirical regret than prior neural contextual bandit methods on real datasets.

What carries the argument

The EE-Net dual-network architecture, where the exploration network outputs an estimate of potential gain to drive adaptive exploration.

If this is right

The regret scales as Õ(√T) in an instance-dependent way rather than worst-case.
The approach applies directly to non-linear reward functions through the use of neural networks.
Empirical performance improves over both linear contextual bandits and prior neural methods that combine networks with TS or UCB.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation into two networks could be tested in other sequential decision settings where uncertainty estimation is expensive.
If joint training succeeds without extra regularization, similar dual-head designs might simplify exploration in deep reinforcement learning.
The instance-dependent bound suggests checking whether the method retains its advantage on problems with very sparse rewards.

Load-bearing premise

The regret bound holds only if the exploration network can be trained to produce useful gain estimates without introducing bias that invalidates the analysis.

What would settle it

An instance where the observed regret of EE-Net grows faster than Õ(√T) or where the learned exploration signals cause the algorithm to select actions worse than a simple epsilon-greedy baseline.

Figures

Figures reproduced from arXiv: 2305.03784 by Arindam Banerjee, Jingrui He, Yikun Ban, Yuchen Yan.

**Figure 2.** Figure 2: With the same exploitation network f1, EE-Net outperforms all baselines. effective exploration. To make exploration, NeuralUCB statistically calculates a gradientbased upper confidence bound and NeuralTS draws each arm’s predicted reward from a normal distribution where the standard deviation is computed by gradient. However, the confidence bound or standard deviation they calculate only considers the wor… view at source ↗

**Figure 3.** Figure 3: Ablation study on label function y for f2. EE-Net denotes y1 = r − f1, EE-Net-abs denotes y2 = |r − f1|, and EE-Net-ReLU denotes y3 = ReLU(r − f1). EE-Net shows the best performance on these two datasets. the difference of negative gain. Therefore, y1 usually is the most effective one for empirical performance. 7. Conclusion In this paper, we propose a novel exploration strategy, EE-Net, by investigating t… view at source ↗

**Figure 4.** Figure 4: Two types of exploration: Upward exploration and Downward exploration. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

In this paper, we study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration trade-off in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, a series of neural bandit algorithms have been proposed to adapt to the non-linear reward function, combined with TS or UCB strategies for exploration. In this paper, instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose, ``EE-Net,'' a novel neural-based exploitation and exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration. We provide an instance-based $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound for EE-Net and show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EE-Net's separate exploration network is a real shift from standard neural bandit setups, but the claimed instance-based sqrt(T) regret lacks the function-class and joint-training details needed to check it.

read the letter

The main point is that EE-Net trains one network for reward prediction and a second one to estimate potential gains for exploration, instead of layering TS or UCB on top of a single reward model. That architecture choice is new relative to the neural-bandit papers cited in the abstract, and the authors report that it beats both linear and neural baselines on real datasets while claiming an instance-dependent O(sqrt(T)) regret bound.

Referee Report

3 major / 1 minor

Summary. The paper proposes EE-Net, a contextual bandit algorithm that employs two neural networks: an exploitation network to model the reward function and an exploration network to adaptively estimate potential gains relative to the current reward estimate. It claims an instance-based Õ(√T) regret upper bound and reports empirical superiority over linear and neural contextual bandit baselines on real-world datasets.

Significance. If the regret bound can be established under explicit assumptions, the approach would introduce a learned, instance-adaptive exploration mechanism that avoids large-deviation statistical bounds, potentially improving performance in non-linear reward settings. The empirical comparisons suggest practical relevance, but the theoretical contribution cannot be assessed without the missing analysis.

major comments (3)

[Abstract] Abstract: The manuscript states an instance-based Õ(√T) regret upper bound for EE-Net but provides no derivation, no function-class assumptions on the neural networks (e.g., bounded-norm ReLU networks of specified width/depth), and no description of the joint training procedure or regularization that would allow control of approximation and optimization errors inside the regret analysis.
[Theoretical Analysis] Theoretical contribution: The central claim requires that the exploration network produces a potential-gain estimate whose uniform error can be bounded without introducing bias terms that degrade the Õ(√T) rate; no such argument, covering-number or Rademacher-complexity control, or training-error bound is supplied, rendering the regret statement unverifiable from the given text.
[Method] Method description: The joint optimization of the exploitation and exploration networks is described at a high level only; without an explicit loss, alternation schedule, or generalization bound that is folded into the regret analysis, it is impossible to confirm that the claimed rate holds.

minor comments (1)

The abstract refers to 'real-world datasets' without naming them or summarizing the experimental protocol (number of runs, hyper-parameter selection, etc.).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the theoretical claims and method clarity. We address each point below and will incorporate revisions to strengthen the presentation of the regret analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states an instance-based Õ(√T) regret upper bound for EE-Net but provides no derivation, no function-class assumptions on the neural networks (e.g., bounded-norm ReLU networks of specified width/depth), and no description of the joint training procedure or regularization that would allow control of approximation and optimization errors inside the regret analysis.

Authors: We agree the abstract is too concise. The full manuscript states the bound in Theorem 1 (Section 4) under the assumption of bounded-norm ReLU networks, but the abstract does not preview the function-class or error-control details. In revision we will expand the abstract to note the network assumptions and that approximation/optimization errors are controlled via the joint training procedure described in Section 3. revision: yes
Referee: [Theoretical Analysis] Theoretical contribution: The central claim requires that the exploration network produces a potential-gain estimate whose uniform error can be bounded without introducing bias terms that degrade the Õ(√T) rate; no such argument, covering-number or Rademacher-complexity control, or training-error bound is supplied, rendering the regret statement unverifiable from the given text.

Authors: The instance-dependent analysis in Section 4 aims to bound the exploration-network error directly via the realized reward gaps rather than uniform large-deviation bounds. However, the current text supplies only a high-level argument and omits explicit covering-number or Rademacher controls on the joint training error. We will add these controls (or a proof sketch) to the revised Section 4 and appendix so the Õ(√T) rate is fully verifiable. revision: yes
Referee: [Method] Method description: The joint optimization of the exploitation and exploration networks is described at a high level only; without an explicit loss, alternation schedule, or generalization bound that is folded into the regret analysis, it is impossible to confirm that the claimed rate holds.

Authors: Algorithm 1 and Section 3.2 outline the two-network training, but the explicit losses, alternation frequency, and how the resulting generalization error enters the regret proof are only sketched. We will add the precise loss functions, the alternation schedule, and the explicit generalization term inside the regret decomposition in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a new construction with independent content

full rationale

The paper presents EE-Net as a novel two-network architecture (exploitation network for reward function, exploration network for potential gains) and states an instance-based Õ(√T) regret upper bound without any equations or steps that reduce the bound, the exploration signal, or the claimed performance to a fitted quantity defined by the same data or by self-citation. The abstract and description frame the approach as a distinct strategy from prior TS/UCB neural methods rather than a re-derivation, and no load-bearing premise collapses to an input by construction. The central claim therefore retains independent content from the proposed architecture.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract supplies almost no explicit assumptions or parameters; the regret bound and the training procedure must rest on standard supervised-learning assumptions for neural nets plus whatever conditions are needed for the exploration network to produce a valid bonus. No invented entities are introduced beyond the two networks themselves.

free parameters (1)

network architectures and learning rates
Standard neural-net hyper-parameters that must be chosen for both the exploitation and exploration networks; their values are not reported in the abstract.

axioms (1)

domain assumption The reward function belongs to a class for which a neural network can approximate it sufficiently well for the regret analysis to go through.
Implicit in any neural contextual bandit claim; required for both the bound and the empirical claim.

pith-pipeline@v0.9.0 · 5724 in / 1499 out tokens · 28233 ms · 2026-05-24T08:19:06.262591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost) J_uniquely_calibrated_via_higher_derivative; reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose EE-Net … another neural network (Exploration network) to adaptively learn the potential gains … input … gradient of f1 … ground truth … residual … instance-based Õ(√T) regret … Ψ(θ20,R/m1/4) = inf … ∑(f2(xt;θ)−rt)²
IndisputableMonolith/Foundation/DimensionForcing.lean; AlexanderDuality.lean alexander_duality_circle_linking (D=3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regret bound … independent of d or ˜d … improves by √logT … over NeuralUCB/TS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Convolutional neural bandit: Provable algorithm for visual-aware advertising

Yikun Ban and Jingrui He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438, 2021a. Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. InProceedings of the Web Conference 2021, pages 2335–2346, 2021b. Yikun Ban, Jingrui He, and Curtiss B Cook. Multi-facet contextual ban...

work page arXiv 2021
[2]

Provable regret bounds for deep online learning and control.arXiv preprint arXiv:2110.07807,

Xinyi Chen, Edgar Minasyan, Jason D Lee, and Elad Hazan. Provable regret bounds for deep online learning and control.arXiv preprint arXiv:2110.07807,

work page arXiv
[3]

SDG: A simpliﬁed and dynamic graph neural network

Dongqi Fu and Jingrui He. SDG: A simpliﬁed and dynamic graph neural network. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2273–2277. ACM,

work page 2021
[4]

Convolutional networks for images, speech, and time series

Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995,

work page 1995
[5]

Provably optimal algorithms for generalized linear contextual bandits

Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. InInternational Conference on Machine Learning, pages 2071–2080. PMLR,

work page 2071
[6]

Ensemble sampling.arXiv preprint arXiv:1705.07347,

15 Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling.arXiv preprint arXiv:1705.07347,

work page arXiv
[7]

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.arXiv preprint arXiv:1802.09127,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Finite-Time Analysis of Kernelised Contextual Bandits

Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits.arXiv preprint arXiv:1309.6869,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Neural contextual bandits with deep representation and shallow exploration.arXiv preprint arXiv:2012.01780,

Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration.arXiv preprint arXiv:2012.01780,

work page arXiv 2012
[10]

Further Discussion A.1 Independence of Input or Eﬀective Dimension This property comes from the new analysis methodology of EE-Net

16 A. Further Discussion A.1 Independence of Input or Eﬀective Dimension This property comes from the new analysis methodology of EE-Net. As the linear bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019;) work in the Euclidean space and build the conﬁdence ellipsoid forθ∗ (optimal parameters) based on the linear function...

work page 2014
[11]

Upward" Exploration

xt = arg maxxt,i,i∈[n] ( f1(xt,i; θ1) +UCBt,i ) . EE-Net (Our approach) ∀i∈ [n], compute f1(xt,i; θ1), f2 ( ∇θ1f1(xt,i; θ1); θ2) (Exploration Net). Then xt = arg maxxt,ii∈[n]f3(f1,f 2; θ3). In this section, we list one gradient-based UCB from existing works (Ban et al., 2021; Zhou et al., 2020), which motivates our design of exploration networkf2. Lemma A...

work page 2021
[12]

,$;𝜃!) ℎ(𝑥

When the estimated reward is larger than the expected reward, i.e.,h(x)−f1(x) < 0, we need to do the ‘downward exploration’, i.e., lowering the exploration score ofx to reduce its chance of being explored; 18 𝑓!(𝑥",$;𝜃!) ℎ(𝑥",$) Gap 𝑓!(𝑥",$;𝜃!) ℎ(𝑥",$) Gap Case 1: Upward ExplorationCase 2: Downward Exploration Figure 4: Two types of exploration: Upward ex...

work page 2019
[13]

(2),∥∇θf(xt,i; θ)∥2≤O ( √ L)

With probability at least 1−O (TnL 2)· exp(−Ω(mω2/3L)) over the random initialization, for allt∈ [T ],i∈ [n], θ satisfying∥θ− θ0∥2≤ω with ω≤O (L−9/2[logm]−3), it holds uniformly that (1),|f(xt,i; θ)|≤O (1). (2),∥∇θf(xt,i; θ)∥2≤O ( √ L). (3),∥∇θLt(θ)∥2≤O ( √ L) Proof. (1) is a simply application of Cauchy–Schwarz inequality. |f(xt,i; θ)| =|WL( L−1∏ l=1 DlW...

work page 2019
[14]

With probability at least 1−O (TnL 2)· exp(−Ω(mω2/3L)) over the random initialization, for all∥x∥2 = 1, θ satisfying∥θ− θ0∥2,∥θ′− θ0∥2≤ω with ω≤O (L−9/2[logm]−3), it holds uniformly that |f(x; θ)−f(x; θ′)−⟨∇ θ′f(x; θ′), θ− θ′⟩|≤O (w1/3L2√ logm)∥θ− θ′∥2. Proof. Based on Lemma 4.1 (Cao and Gu, 2019), it holds uniformly that |√mf(x; θ)−√mf(x; θ′)−⟨√m∇θ′f(x; ...

work page 2019

[1] [1]

Convolutional neural bandit: Provable algorithm for visual-aware advertising

Yikun Ban and Jingrui He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. arXiv preprint arXiv:2107.07438, 2021a. Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. InProceedings of the Web Conference 2021, pages 2335–2346, 2021b. Yikun Ban, Jingrui He, and Curtiss B Cook. Multi-facet contextual ban...

work page arXiv 2021

[2] [2]

Provable regret bounds for deep online learning and control.arXiv preprint arXiv:2110.07807,

Xinyi Chen, Edgar Minasyan, Jason D Lee, and Elad Hazan. Provable regret bounds for deep online learning and control.arXiv preprint arXiv:2110.07807,

work page arXiv

[3] [3]

SDG: A simpliﬁed and dynamic graph neural network

Dongqi Fu and Jingrui He. SDG: A simpliﬁed and dynamic graph neural network. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2273–2277. ACM,

work page 2021

[4] [4]

Convolutional networks for images, speech, and time series

Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995,

work page 1995

[5] [5]

Provably optimal algorithms for generalized linear contextual bandits

Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. InInternational Conference on Machine Learning, pages 2071–2080. PMLR,

work page 2071

[6] [6]

Ensemble sampling.arXiv preprint arXiv:1705.07347,

15 Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling.arXiv preprint arXiv:1705.07347,

work page arXiv

[7] [7]

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling.arXiv preprint arXiv:1802.09127,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Finite-Time Analysis of Kernelised Contextual Bandits

Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits.arXiv preprint arXiv:1309.6869,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Neural contextual bandits with deep representation and shallow exploration.arXiv preprint arXiv:2012.01780,

Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration.arXiv preprint arXiv:2012.01780,

work page arXiv 2012

[10] [10]

Further Discussion A.1 Independence of Input or Eﬀective Dimension This property comes from the new analysis methodology of EE-Net

16 A. Further Discussion A.1 Independence of Input or Eﬀective Dimension This property comes from the new analysis methodology of EE-Net. As the linear bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019;) work in the Euclidean space and build the conﬁdence ellipsoid forθ∗ (optimal parameters) based on the linear function...

work page 2014

[11] [11]

Upward" Exploration

xt = arg maxxt,i,i∈[n] ( f1(xt,i; θ1) +UCBt,i ) . EE-Net (Our approach) ∀i∈ [n], compute f1(xt,i; θ1), f2 ( ∇θ1f1(xt,i; θ1); θ2) (Exploration Net). Then xt = arg maxxt,ii∈[n]f3(f1,f 2; θ3). In this section, we list one gradient-based UCB from existing works (Ban et al., 2021; Zhou et al., 2020), which motivates our design of exploration networkf2. Lemma A...

work page 2021

[12] [12]

,$;𝜃!) ℎ(𝑥

When the estimated reward is larger than the expected reward, i.e.,h(x)−f1(x) < 0, we need to do the ‘downward exploration’, i.e., lowering the exploration score ofx to reduce its chance of being explored; 18 𝑓!(𝑥",$;𝜃!) ℎ(𝑥",$) Gap 𝑓!(𝑥",$;𝜃!) ℎ(𝑥",$) Gap Case 1: Upward ExplorationCase 2: Downward Exploration Figure 4: Two types of exploration: Upward ex...

work page 2019

[13] [13]

(2),∥∇θf(xt,i; θ)∥2≤O ( √ L)

With probability at least 1−O (TnL 2)· exp(−Ω(mω2/3L)) over the random initialization, for allt∈ [T ],i∈ [n], θ satisfying∥θ− θ0∥2≤ω with ω≤O (L−9/2[logm]−3), it holds uniformly that (1),|f(xt,i; θ)|≤O (1). (2),∥∇θf(xt,i; θ)∥2≤O ( √ L). (3),∥∇θLt(θ)∥2≤O ( √ L) Proof. (1) is a simply application of Cauchy–Schwarz inequality. |f(xt,i; θ)| =|WL( L−1∏ l=1 DlW...

work page 2019

[14] [14]

With probability at least 1−O (TnL 2)· exp(−Ω(mω2/3L)) over the random initialization, for all∥x∥2 = 1, θ satisfying∥θ− θ0∥2,∥θ′− θ0∥2≤ω with ω≤O (L−9/2[logm]−3), it holds uniformly that |f(x; θ)−f(x; θ′)−⟨∇ θ′f(x; θ′), θ− θ′⟩|≤O (w1/3L2√ logm)∥θ− θ′∥2. Proof. Based on Lemma 4.1 (Cao and Gu, 2019), it holds uniformly that |√mf(x; θ)−√mf(x; θ′)−⟨√m∇θ′f(x; ...

work page 2019