Stochastic One-Sided Full-Information Bandit

Haoyu Zhao; Wei Chen

arxiv: 1906.08656 · v1 · pith:SSKVVR7Snew · submitted 2019-06-20 · 💻 cs.LG · cs.DS· stat.ML

Stochastic One-Sided Full-Information Bandit

Haoyu Zhao , Wei Chen This is my paper

Pith reviewed 2026-05-25 19:36 UTC · model grok-4.3

classification 💻 cs.LG cs.DSstat.ML

keywords stochastic banditone-sided full informationelimination algorithmregret boundsdistribution independentdistribution dependentsecond price auction

0 comments

The pith

An elimination-based algorithm achieves the best-known regret bounds for the stochastic one-sided full-information bandit problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the stochastic one-sided full-information bandit, in which selecting an arm yields its reward and the rewards of all arms with higher indices. It introduces an elimination-based algorithm designed to efficiently discard suboptimal arms using the one-sided feedback. This algorithm delivers a distribution-independent regret bound of O(sqrt(T log(TK))) and a distribution-dependent bound of O((log T + log K) f(Δ)), where Δ represents the gaps to the best arm. These are presented as the tightest theoretical upper bounds available for this problem. The setting applies directly to repeated second-price auctions in which the auctioneer sets reserve prices and observes bids only when they exceed the reserve.

Core claim

The central discovery is that an elimination based algorithm achieves distribution independent regret upper bound O(√(T·log (TK))), and distribution dependent bound O((log T + log K)f(Δ)), and has the best theoretical regret upper bound so far for the stochastic one-sided full-information bandit.

What carries the argument

Elimination-based algorithm that uses one-sided reward observations to generate unbiased estimates for all arms and eliminates suboptimal ones.

If this is right

The regret bounds apply to the modeling of online repeated second-price auctions.
Suboptimal arms can be removed efficiently leading to the logarithmic dependence on T and K in the gap-dependent bound.
The algorithm improves upon prior theoretical guarantees for this bandit variant.
Empirical tests confirm better performance compared to alternative approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique of using one-sided feedback for unbiased estimates could apply to other auction or ranking-based decision problems.
Analyzing the specific form of f(Δ) might reveal how the bound behaves under particular gap structures.
The bounds suggest potential for low-regret reserve price selection in repeated auctions with many bidders.

Load-bearing premise

The one-sided observations from playing arm i suffice to produce unbiased estimates of all relevant arm means, allowing the elimination procedure to remove suboptimal arms at the rate needed for the stated regret bounds to hold.

What would settle it

A simulation or theoretical construction where the algorithm's regret grows faster than O(sqrt(T log(TK))) as T increases, under arbitrary reward distributions, would falsify the distribution-independent bound.

Figures

Figures reproduced from arXiv: 1906.08656 by Haoyu Zhao, Wei Chen.

**Figure 2.** Figure 2: Uniform-mean suboptimal arms with K = 20, ∆ = 0.1, and varying Best. introduced in [7], which solves one-sided full information bandit in the adversarial case, and (b) UCB-N algorithm introduced in [6] to solve stochastic multiarmed bandit with side information, and it is essentially UCB but updates any arm when it has an observation, not just the arm played in the round. First, we do experiments when al… view at source ↗

**Figure 3.** Figure 3: Random-mean suboptimal arms with K = 20, Best = 17, and varying ∆. (a) Best = 10 (b) Best = 3 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Uniform-mean suboptimal arms with K = 20, ∆ = 0.1, and varying Best. 2. In the first experiments, when we change the position of the best arm, the regret line does not change so much. In the second experiments where we add more randomness, if the position of the best arm has small index, then our algorithm will perform better. However, the existing EXP3-RTB algorithm does not have this property. 3. UCB-N c… view at source ↗

read the original abstract

In this paper, we study the stochastic version of the one-sided full information bandit problem, where we have $K$ arms $[K] = \{1, 2, \ldots, K\}$, and playing arm $i$ would gain reward from an unknown distribution for arm $i$ while obtaining reward feedback for all arms $j \ge i$. One-sided full information bandit can model the online repeated second-price auctions, where the auctioneer could select the reserved price in each round and the bidders only reveal their bids when their bids are higher than the reserved price. In this paper, we present an elimination-based algorithm to solve the problem. Our elimination based algorithm achieves distribution independent regret upper bound $O(\sqrt{T\cdot\log (TK)})$, and distribution dependent bound $O((\log T + \log K)f(\Delta))$, where $T$ is the time horizon, $\Delta$ is a vector of gaps between the mean reward of arms and the mean reward of the best arm, and $f(\Delta)$ is a formula depending on the gap vector that we will specify in detail. Our algorithm has the best theoretical regret upper bound so far. We also validate our algorithm empirically against other possible alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new elimination algorithm for the stochastic one-sided full-information bandit with O(sqrt(T log(TK))) distribution-free regret and a gap-dependent bound, presented as the best so far for this auction-motivated setting.

read the letter

The main takeaway is that this paper introduces an elimination algorithm tailored to the stochastic one-sided full-information bandit, where playing arm i yields rewards for all j >= i. It claims an O(sqrt(T log(TK))) distribution-independent regret bound and an O((log T + log K) f(Delta)) distribution-dependent bound, along with some empirical checks against alternatives. The setting directly models repeated second-price auctions with reserve prices as the arm choices.

Referee Report

0 major / 3 minor

Summary. The paper studies the stochastic one-sided full-information bandit problem with K arms, where playing arm i yields a reward sample from arm i's unknown distribution together with reward observations for all arms j ≥ i. It proposes an elimination-based algorithm and claims it attains a distribution-independent regret bound of O(√(T log(TK))) and a distribution-dependent bound of O((log T + log K) f(Δ)), where f(Δ) is a gap-dependent function specified in the paper; the algorithm is asserted to have the best known theoretical regret upper bound and is supported by empirical comparisons.

Significance. If the stated regret bounds hold under the one-sided feedback model, the work supplies improved theoretical guarantees for a structured bandit setting that directly models repeated second-price auctions. The elimination procedure and the explicit rates constitute a technical contribution that extends classical stochastic bandit analysis to partial-observation regimes, with potential implications for online mechanism design.

minor comments (3)

[Abstract] Abstract: the function f(Δ) is referenced but not defined or sketched; a one-sentence description or pointer to the precise expression (presumably in §4 or §5) would improve readability.
[Experiments] The empirical section should report the precise number of independent runs, the range of T and K tested, and the exact baselines (including any prior one-sided algorithms) to allow direct replication of the claimed superiority.
[Preliminaries] Notation: the definition of the gap vector Δ and the precise form of the elimination threshold should be stated once in a dedicated preliminary section rather than introduced inline in the algorithm description.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance for structured bandit problems and online mechanism design, and the recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No circularity: derivation relies on standard concentration and elimination

full rationale

The paper presents an elimination algorithm for the one-sided full-information bandit and derives distribution-independent regret O(√(T log(TK))) and distribution-dependent O((log T + log K) f(Δ)) bounds. These follow from unbiased estimates obtained directly from one-sided observations and standard concentration inequalities applied to empirical means; elimination thresholds are set accordingly. No step reduces a claimed prediction to a fitted parameter by construction, no self-citation is load-bearing for the central bounds, and the modeling assumptions (unbiasedness from one-sided feedback) are external to the regret derivation itself. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; standard stochastic bandit assumptions are likely implicit but unspecified.

pith-pipeline@v0.9.0 · 5745 in / 957 out tokens · 30415 ms · 2026-05-25T19:36:40.121146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

In: Advances in Neural Information Process- ing Systems

Alon, N., Cesa-Bianchi, N., Gentile, C., Mansour, Y.: From bandits to experts: A tale of domination and independence. In: Advances in Neural Information Process- ing Systems. pp. 1610–1618 (2013)

work page 2013
[2]

Machine Learning 47(2-3), 235–256 (2002)

Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)

work page 2002
[3]

Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi- armed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

work page 2002
[4]

Chapman and Hall (1985)

Berry, D.A., Fristedt, B.: Bandit problems: Sequential Allocation of Experiments. Chapman and Hall (1985)

work page 1985
[5]

Foundations and Trends in Machine Learning 5(1), 1–122 (2012)

Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Foundations and Trends in Machine Learning 5(1), 1–122 (2012)

work page 2012
[6]

Caron, S., Kveton, B., Lelarge, M., Bhagat, S.: Leveraging side observations in stochastic bandits. In: UAI. pp. 142–151 (2012)

work page 2012
[7]

In: Proceedings of the 30th Conference on Learning Theory

Cesa-Bianchi, N., Gaillard, P., Gentile, C., Gerchinovitz, S.: Algorithmic chaining and the role of partial feedback in online nonparametric learning. In: Proceedings of the 30th Conference on Learning Theory. pp. 465–481 (2017)

work page 2017
[8]

In: Advances in Neural Information Processing Systems

Mannor, S., Shamir, O.: From bandits to experts: On the value of side-observations. In: Advances in Neural Information Processing Systems. pp. 684–692 (2011)

work page 2011
[9]

Bulletin Amer- ican Mathematical Society 55, 527–535 (1952) Stochastic One-Sided Full-Information Bandit 17

Robbins, H.: Some aspects of the sequential design of experiments. Bulletin Amer- ican Mathematical Society 55, 527–535 (1952) Stochastic One-Sided Full-Information Bandit 17

work page 1952
[10]

The American mathematical monthly 62(1), 26–29 (1955)

Robbins, H.: A remark on stirling’s formula. The American mathematical monthly 62(1), 26–29 (1955)

work page 1955
[11]

Biometrika 25(3/4), 285–294 (1933)

Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

work page 1933
[12]

Young, N.: Reverse chernoﬀ bound. Theoretical Computer Science Stack Exchange, https://cstheory.stackexchange.com/q/14476 18 Haoyu Zhao and Wei Chen Supplementary Material This part contains the missing proofs in the main text. For convenience, we restate the theorems and lemmas here. A Missing Proof in Section 3.2 A.1 Proof of Lemma 1 Lemma 1. For each r...

work page
[13]

Then choose ε = √ c lnK/T , and sum over t which satisﬁes c lnK 2ε2 ≤ t ≤ c lnK ε2 , we complete the proof

+ Pr{j /∈St}µ∗) =µ∗ − 1 K K∑ j=1 I{j ∈St}ε 8 ≤µ∗ − ε 24. Then choose ε = √ c lnK/T , and sum over t which satisﬁes c lnK 2ε2 ≤ t ≤ c lnK ε2 , we complete the proof. ⊓ ⊔

work page

[1] [1]

In: Advances in Neural Information Process- ing Systems

Alon, N., Cesa-Bianchi, N., Gentile, C., Mansour, Y.: From bandits to experts: A tale of domination and independence. In: Advances in Neural Information Process- ing Systems. pp. 1610–1618 (2013)

work page 2013

[2] [2]

Machine Learning 47(2-3), 235–256 (2002)

Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)

work page 2002

[3] [3]

Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi- armed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

work page 2002

[4] [4]

Chapman and Hall (1985)

Berry, D.A., Fristedt, B.: Bandit problems: Sequential Allocation of Experiments. Chapman and Hall (1985)

work page 1985

[5] [5]

Foundations and Trends in Machine Learning 5(1), 1–122 (2012)

Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Foundations and Trends in Machine Learning 5(1), 1–122 (2012)

work page 2012

[6] [6]

Caron, S., Kveton, B., Lelarge, M., Bhagat, S.: Leveraging side observations in stochastic bandits. In: UAI. pp. 142–151 (2012)

work page 2012

[7] [7]

In: Proceedings of the 30th Conference on Learning Theory

Cesa-Bianchi, N., Gaillard, P., Gentile, C., Gerchinovitz, S.: Algorithmic chaining and the role of partial feedback in online nonparametric learning. In: Proceedings of the 30th Conference on Learning Theory. pp. 465–481 (2017)

work page 2017

[8] [8]

In: Advances in Neural Information Processing Systems

Mannor, S., Shamir, O.: From bandits to experts: On the value of side-observations. In: Advances in Neural Information Processing Systems. pp. 684–692 (2011)

work page 2011

[9] [9]

Bulletin Amer- ican Mathematical Society 55, 527–535 (1952) Stochastic One-Sided Full-Information Bandit 17

Robbins, H.: Some aspects of the sequential design of experiments. Bulletin Amer- ican Mathematical Society 55, 527–535 (1952) Stochastic One-Sided Full-Information Bandit 17

work page 1952

[10] [10]

The American mathematical monthly 62(1), 26–29 (1955)

Robbins, H.: A remark on stirling’s formula. The American mathematical monthly 62(1), 26–29 (1955)

work page 1955

[11] [11]

Biometrika 25(3/4), 285–294 (1933)

Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

work page 1933

[12] [12]

Young, N.: Reverse chernoﬀ bound. Theoretical Computer Science Stack Exchange, https://cstheory.stackexchange.com/q/14476 18 Haoyu Zhao and Wei Chen Supplementary Material This part contains the missing proofs in the main text. For convenience, we restate the theorems and lemmas here. A Missing Proof in Section 3.2 A.1 Proof of Lemma 1 Lemma 1. For each r...

work page

[13] [13]

Then choose ε = √ c lnK/T , and sum over t which satisﬁes c lnK 2ε2 ≤ t ≤ c lnK ε2 , we complete the proof

+ Pr{j /∈St}µ∗) =µ∗ − 1 K K∑ j=1 I{j ∈St}ε 8 ≤µ∗ − ε 24. Then choose ε = √ c lnK/T , and sum over t which satisﬁes c lnK 2ε2 ≤ t ≤ c lnK ε2 , we complete the proof. ⊓ ⊔

work page