Two-Action Apple Tasting with Switching Costs

Roberto Colomboni; Tommaso Cesari

arxiv: 2606.03851 · v1 · pith:OCWVE4X6new · submitted 2026-06-02 · 💻 cs.LG

Two-Action Apple Tasting with Switching Costs

Tommaso Cesari , Roberto Colomboni This is my paper

Pith reviewed 2026-06-28 11:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords apple tastingswitching costsregret boundsoblivious adversaryfeedback graphsonline learningminimax regret

0 comments

The pith

In two-action apple tasting with switching costs the minimax regret against an oblivious adversary is Theta of sqrt(T).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes matching upper and lower bounds showing that the oblivious minimax expected regret for the two-action apple-tasting problem with unit switching costs scales as a constant times sqrt(T). This matters because general algorithms for feedback graphs with switching costs only guarantee T to the 2/3, and this setting had been viewed as a candidate for proving a matching lower bound that would apply to many other graphs. By removing that obstruction the result indicates that the T to the 2/3 barrier is not forced by every two-action partial-feedback structure that includes switches. A reader would care because it narrows the search for which feedback graphs truly require the slower rate.

Core claim

In the normalized formulation the learner repeatedly chooses either the revealing action (reward 0, reveals x_t in [-1,1]) or the blind action (reward x_t, reveals nothing) and pays one unit per switch; regret is measured against the single best fixed action in hindsight. The paper proves that the oblivious minimax expected regret R_T^* satisfies 1/(2 sqrt(3)) sqrt(T) less than or equal to R_T^* less than or equal to 2 sqrt(3) sqrt(T).

What carries the argument

The two-action apple-tasting feedback graph with unit switching costs, analyzed via direct construction of a strategy and a matching lower-bound argument for the oblivious adversary.

If this is right

The two-action apple-tasting graph is not the source of an Omega(T^{2/3}) lower bound for feedback graphs with switching costs.
Specialized algorithms for this exact setting achieve O(sqrt(T)) regret, improving on the general T^{2/3} guarantee.
Any complete classification of feedback graphs under switching costs must locate its hard instances in other graph structures.
The gap between oblivious and adaptive adversaries remains open for this problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that certain two-action partial-feedback problems may escape the switching-cost penalty entirely, inviting checks on nearby graphs that differ only in reward revelation rules.
It raises the question whether the same sqrt(T) rate holds when the adversary is allowed to adapt, which the paper leaves unresolved.
A natural extension is to ask whether adding more than two actions restores the T^{2/3} barrier or keeps the rate at sqrt(T).

Load-bearing premise

The adversary commits to the entire sequence of hidden values x_t in advance and never adapts to the learner's observed choices.

What would settle it

A concrete sequence of x_t values (or a family of sequences) for which the expected regret of any learner exceeds the stated upper bound constant times sqrt(T) for sufficiently large T, or falls below the lower-bound constant.

Figures

Figures reproduced from arXiv: 2606.03851 by Roberto Colomboni, Tommaso Cesari.

read the original abstract

We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward $0$ and reveals the hidden value $x_t\in[-1,1]$ of the blind action; the blind action gives reward $x_t$ but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give $\widetilde O(T^{2/3})$ regret guarantees for this problem. The two-action apple-tasting graph was the natural candidate for the missing $\Omega(T^{2/3})$ obstruction in the switching-cost classification: such a lower bound would have transferred to a large family of still-unclassified feedback graphs. We prove that this obstruction is not there: the oblivious minimax expected regret for this problem satisfies \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows the two-action apple tasting problem with switching costs has matching Θ(√T) minimax regret bounds with explicit constants, so it does not force the T^{2/3} rate.

read the letter

The main thing to know is that this specific two-action feedback graph achieves √T regret against an oblivious adversary, with the first explicit constants: lower bound 1/(2√3) √T and upper bound 2√3 √T. That rules out using this graph as the missing Ω(T^{2/3}) example in the switching-costs classification.

The paper derives these bounds directly in the normalized formulation. The upper bound comes from an algorithm analyzed with a potential function that accounts for the unit switching cost. The lower bound uses a distributional construction and Yao's principle, measuring regret against the best fixed action. Both are self-contained for this instance and correctly incorporate the reveal-vs-blind tradeoff.

The work is solid on its own terms. The rate result holds up, and the constants are stated clearly. A minor soft spot is the factor-of-12 gap between the constants; that is typical for first analyses but means the bounds are not yet tight to each other. The oblivious-adversary setting is standard and matches what the claim requires, so no overreach there.

This is for people working on online learning with feedback graphs and switching costs. A reader who cares about classifying which graphs get √T versus worse rates will get direct value. It deserves a serious referee because it closes a concrete open case with matching bounds rather than just another general upper bound.

Referee Report

0 major / 2 minor

Summary. The paper studies the two-action apple-tasting problem with switching costs against an oblivious adversary. In the normalized formulation, the learner chooses at each round between a revealing action (reward 0, reveals x_t ∈ [-1,1]) and a blind action (reward x_t, no revelation), incurring a unit cost on switches; regret is measured against the best fixed action in hindsight. The central claim is that the oblivious minimax expected regret satisfies 1/(2√3)·√T ≤ R_T^* ≤ 2√3·√T, establishing matching Θ(√T) bounds via an explicit algorithm for the upper bound and a distributional construction for the lower bound.

Significance. If the result holds, it is significant for the classification of feedback graphs with switching costs: it shows that the two-action apple-tasting graph does not furnish the anticipated Ω(T^{2/3}) obstruction, as general feedback-graph algorithms only guarantee Õ(T^{2/3}). The paper supplies self-contained, matching bounds that correctly incorporate the unit switching cost and measure regret against the best fixed action, using a potential-function analysis for the upper bound and the Yao principle for the lower bound.

minor comments (2)

[§3] §3, Algorithm 1: the potential function Φ_t is defined with a specific scaling constant; a brief remark on how this constant is chosen to balance the switching penalty and the estimation error would improve readability.
[§4] §4, Lemma 4.2: the distributional construction for the lower bound uses a specific bias parameter; clarifying why this bias yields exactly the 1/(2√3) factor (rather than a looser constant) would strengthen the presentation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and their recommendation to accept. We appreciate the recognition of the result's significance for the classification of feedback graphs with switching costs.

Circularity Check

0 steps flagged

No significant circularity; bounds derived independently via algorithm and Yao construction

full rationale

The central claim establishes matching Θ(√T) oblivious minimax regret bounds. The upper bound follows from an explicit algorithm analyzed via potential functions in the normalized two-action setting; the lower bound follows from an explicit distributional construction over sequences and application of Yao's principle. Neither reduces to a fitted parameter, self-definition, or load-bearing self-citation. The analysis incorporates the unit switching cost and measures regret against the best fixed action in hindsight, remaining self-contained against the problem formulation without external unverified premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard definitions of minimax regret in online learning against oblivious adversaries and a specialized analysis for the two-action feedback graph; no free parameters or invented entities are introduced.

axioms (1)

standard math Standard definition of minimax expected regret R_T^* against an oblivious adversary in the normalized revealing/blind action setup
The result is expressed directly in terms of this quantity as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5732 in / 1186 out tokens · 25954 ms · 2026-06-28T11:31:16.484104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references

[1]

Online learning with feedback graphs: Beyond bandits

Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. InConference on Learning Theory, pages 23–35. PMLR, 2015

2015
[2]

Bandits with feedback graphs and switching costs.Advances in Neural Information Processing Systems, 32, 2019

Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Bandits with feedback graphs and switching costs.Advances in Neural Information Processing Systems, 32, 2019

2019
[3]

Regret minimization under partial monitoring.Mathematics of Operations Research, 31(3):562–580, 2006

Nicol` o Cesa-Bianchi, G´ abor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring.Mathematics of Operations Research, 31(3):562–580, 2006

2006
[4]

Bandits with switching costs: T 2/3 regret

Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459–467, 2014. 18

2014
[5]

Regret minimization for online buffering problems using the weighted majority algorithm

Sascha Geulen, Berthold V¨ ocking, and Melanie Winkler. Regret minimization for online buffering problems using the weighted majority algorithm. InProceedings of the 23rd Annual Conference on Learning Theory, pages 132–143, 2010

2010
[6]

Helmbold, Nick Littlestone, and Philip M

David P. Helmbold, Nick Littlestone, and Philip M. Long. Apple tasting and nearly one-sided learning. InProceedings of the 33rd Annual Symposium on Foundations of Computer Science, pages 493–502. IEEE Computer Society, 1992

1992
[7]

Apple tasting.Information and Computation, 161(2):85–139, 2000

David P Helmbold, Nicholas Littlestone, and Philip M Long. Apple tasting.Information and Computation, 161(2):85–139, 2000

2000
[8]

Efficient algorithms for online decision problems

Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005

2005
[9]

From bandits to experts: On the value of side-observations

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in neural information processing systems, 24, 2011

2011
[10]

Apple tasting: Combinatorial dimensions and minimax rates

Vinod Raman, Unique Subedi, Ananth Raman, and Ambuj Tewari. Apple tasting: Combinatorial dimensions and minimax rates. InThe Thirty Seventh Annual Conference on Learning Theory, pages 4358–4380. PMLR, 2024. 19

2024

[1] [1]

Online learning with feedback graphs: Beyond bandits

Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. InConference on Learning Theory, pages 23–35. PMLR, 2015

2015

[2] [2]

Bandits with feedback graphs and switching costs.Advances in Neural Information Processing Systems, 32, 2019

Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Bandits with feedback graphs and switching costs.Advances in Neural Information Processing Systems, 32, 2019

2019

[3] [3]

Regret minimization under partial monitoring.Mathematics of Operations Research, 31(3):562–580, 2006

Nicol` o Cesa-Bianchi, G´ abor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring.Mathematics of Operations Research, 31(3):562–580, 2006

2006

[4] [4]

Bandits with switching costs: T 2/3 regret

Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. InProceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459–467, 2014. 18

2014

[5] [5]

Regret minimization for online buffering problems using the weighted majority algorithm

Sascha Geulen, Berthold V¨ ocking, and Melanie Winkler. Regret minimization for online buffering problems using the weighted majority algorithm. InProceedings of the 23rd Annual Conference on Learning Theory, pages 132–143, 2010

2010

[6] [6]

Helmbold, Nick Littlestone, and Philip M

David P. Helmbold, Nick Littlestone, and Philip M. Long. Apple tasting and nearly one-sided learning. InProceedings of the 33rd Annual Symposium on Foundations of Computer Science, pages 493–502. IEEE Computer Society, 1992

1992

[7] [7]

Apple tasting.Information and Computation, 161(2):85–139, 2000

David P Helmbold, Nicholas Littlestone, and Philip M Long. Apple tasting.Information and Computation, 161(2):85–139, 2000

2000

[8] [8]

Efficient algorithms for online decision problems

Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005

2005

[9] [9]

From bandits to experts: On the value of side-observations

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in neural information processing systems, 24, 2011

2011

[10] [10]

Apple tasting: Combinatorial dimensions and minimax rates

Vinod Raman, Unique Subedi, Ananth Raman, and Ambuj Tewari. Apple tasting: Combinatorial dimensions and minimax rates. InThe Thirty Seventh Annual Conference on Learning Theory, pages 4358–4380. PMLR, 2024. 19

2024