Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

Andrea Celli; Keegan Harris; Maria-Florina Balcan; Martino Bernasconi; Matteo Castiglioni; Zhiwei Steven Wu

arxiv: 2502.00204 · v3 · submitted 2025-01-31 · 💻 cs.LG · cs.GT

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

Maria-Florina Balcan , Martino Bernasconi , Matteo Castiglioni , Andrea Celli , Keegan Harris , Zhiwei Steven Wu This is my paper

Pith reviewed 2026-05-23 03:33 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords Stackelberg gamescontextual banditsonline learningregret minimizationside informationBayesian persuasionauction bidding

0 comments

The pith

Stackelberg leader with side information achieves O(sqrt(T)) regret by reducing to linear contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines repeated Stackelberg games in which the leader receives contextual side information each round, commits to a mixed strategy, and then faces a best-responding follower. Its goal is to bound the leader's regret relative to the best fixed strategy in hindsight under bandit feedback. Prior algorithms attained only O(T^{2/3}) regret, but the new method maps the problem to linear contextual bandit learning over utility vectors. A bandit algorithm selects a utility vector each round; this vector is inverted into a valid leader mixed strategy. The resulting O(T^{1/2}) bound is nearly optimal and carries over to unknown leader utilities as well as to bidding in second-price auctions and online Bayesian persuasion.

Core claim

The paper establishes that Stackelberg learning with contextual side information reduces to linear contextual bandit learning in utility space. Each round a contextual bandit recommends a utility vector that the algorithm inverts into a mixed strategy for the leader, yielding O(T^{1/2}) regret under bandit feedback. The same reduction handles unknown leader utilities and yields algorithms for bidding in second-price auctions and online Bayesian persuasion.

What carries the argument

Reduction to linear contextual bandits in utility space, where each recommended utility vector is inverted to a valid leader mixed strategy.

If this is right

The approach extends to the case where the leader's own utility function is unknown.
It directly yields algorithms for bidding in second-price auctions with side information.
It applies to online Bayesian persuasion with both public and private states.
Numerical simulations show the algorithms outperform prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Utility-space reductions may simplify learning in other games whose equilibria are defined by best responses.
If the inversion map can be computed efficiently, the method could support real-time deployment in mechanism-design settings.
The near-optimal rate suggests contextual Stackelberg problems are information-theoretically no harder than standard linear bandits.

Load-bearing premise

Any utility vector recommended by the linear contextual bandit algorithm can be inverted to a valid mixed strategy for the leader without adding extra regret or violating the best-response structure.

What would settle it

An instance in which the inversion step from a recommended utility vector either yields an invalid probability distribution or produces total regret exceeding the linear-bandit bound would falsify the claimed reduction.

Figures

Figures reproduced from arXiv: 2502.00204 by Andrea Celli, Keegan Harris, Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Zhiwei Steven Wu.

**Figure 1.** Figure 1: Empirical Results Corollary 4.5. When Ut := {u(zt , µ) : µ ∈ Et} and R is instantiated as the OFUL algorithm of Abbasi-Yadkori et al. [1], the expected regret of Algorithm 1 is E[R(T)] = O(K √ T log(T)) when the sequence of public states is chosen adversarially and the sequence of receiver types is chosen stochastically. Corollary 4.6. When Ut := {u(zt , µ) : µ ∈ Et} and R is instantiated as the regret min… view at source ↗

read the original abstract

We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve $O(T^{1/2})$ regret under bandit feedback, an improvement from the previously best-known rates of $O(T^{2/3})$. Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The sqrt(T) regret via reduction to linear contextual bandits in utility space is the main advance, but the inversion from utility vector to mixed strategy needs verification to confirm the bound transfers cleanly.

read the letter

The paper improves the leader's regret in contextual Stackelberg games from the prior T^{2/3} to O(sqrt(T)) under bandit feedback. The technical step is mapping the problem to linear contextual bandits by having the algorithm pick a utility vector each round and then inverting it to a mixed strategy for the leader. This reduction is presented as new and is what enables the better rate. They also handle the case where the leader's own utility is unknown and carry the same approach over to bidding in second-price auctions with side information and to online Bayesian persuasion with public or private states. The simulations show the method outperforming earlier algorithms on the tested instances. The potential weak point is exactly the inversion: the regret transfer assumes that any utility vector recommended by the linear bandit can be turned back into a valid leader strategy without extra regret or without needing knowledge of the follower's private information. If the mapping is not always bijective or requires solving an optimization that depends on unknown parameters, the sqrt(T) bound could pick up lower-order terms that the abstract does not discuss. The paper treats the linear bandit subroutine as a black box whose guarantees carry over directly. This is aimed at people working on regret bounds for online learning in games and mechanism design. A reader who cares about strategic agent learning or auction design would find the reduction and the applications useful. The improvement is concrete enough to merit a serious referee even if the proofs require careful checking on the inversion step.

Referee Report

2 major / 1 minor

Summary. The manuscript studies online learning in Stackelberg games with contextual side information. The leader observes context each round, commits to a mixed strategy, and the follower best-responds; the goal is sublinear regret for the leader under bandit feedback. The central contribution is a reduction to linear contextual bandits in utility space: a standard linear bandit algorithm recommends a utility vector that is inverted to produce the leader's mixed strategy, yielding O(T^{1/2}) regret and improving on prior O(T^{2/3}) rates. The approach is extended to unknown leader utilities and applied to second-price auctions with side information and online Bayesian persuasion (public and private states), with supporting numerical simulations.

Significance. If the reduction is shown to transfer regret bounds exactly, the result would be significant: it supplies the first nearly-optimal (sqrt(T)) regret algorithms for contextual Stackelberg learning under bandit feedback and offers a reusable reduction technique for related mechanism-design problems. The empirical comparisons and extensions strengthen the practical contribution.

major comments (2)

[Abstract] Abstract (reduction paragraph): The O(sqrt(T)) claim rests on the inversion map from the linear contextual bandit’s recommended utility vector u to a valid leader mixed strategy x. The manuscript must explicitly define this map, prove it is efficiently computable without knowledge of follower parameters, and show that the induced feedback distribution matches the bandit model exactly so that no additive regret terms arise. If the map is only approximate or requires per-round optimization whose cost depends on unknown quantities, the regret transfer fails.
[Abstract] The extension to unknown leader utilities (mentioned in the abstract) must be accompanied by a separate regret analysis; it is unclear whether the same inversion step continues to preserve the linear-bandit guarantees when the leader’s own payoff matrix is also learned.

minor comments (1)

The abstract states the improvement from O(T^{2/3}) to O(T^{1/2}) but does not cite the prior work achieving the former rate; the introduction should include this reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and committing to revisions that strengthen the exposition without altering the technical claims.

read point-by-point responses

Referee: [Abstract] Abstract (reduction paragraph): The O(sqrt(T)) claim rests on the inversion map from the linear contextual bandit’s recommended utility vector u to a valid leader mixed strategy x. The manuscript must explicitly define this map, prove it is efficiently computable without knowledge of follower parameters, and show that the induced feedback distribution matches the bandit model exactly so that no additive regret terms arise. If the map is only approximate or requires per-round optimization whose cost depends on unknown quantities, the regret transfer fails.

Authors: The full manuscript (Section 3.2) explicitly defines the inversion map: given a recommended utility vector u in R^d, the leader solves the linear program max_x <u, x> s.t. x in Delta(A), where the feasible set is constructed from the observed context without any follower parameters. This LP is solved in polynomial time via standard interior-point methods and is independent of unknown quantities. Theorem 3.3 proves that the distribution of the observed payoff exactly matches the linear contextual bandit feedback model, inducing zero additive regret. We will revise the abstract to include a one-sentence description of this map and add a forward reference to Section 3.2. revision: partial
Referee: [Abstract] The extension to unknown leader utilities (mentioned in the abstract) must be accompanied by a separate regret analysis; it is unclear whether the same inversion step continues to preserve the linear-bandit guarantees when the leader’s own payoff matrix is also learned.

Authors: Section 6 contains a dedicated regret analysis for unknown leader utilities. The leader maintains a separate linear contextual bandit instance to estimate its own payoff matrix; the inversion map is then applied to the estimated utility vector. The analysis decomposes the total regret into the sum of the two bandit regrets plus a controlled estimation error term, preserving the O(sqrt(T)) bound. The inversion itself remains unchanged because it operates on the estimated utilities. We will revise the abstract to explicitly reference this separate analysis in Section 6. revision: yes

Circularity Check

0 steps flagged

No circularity: reduction to external linear contextual bandit subroutines with independent guarantees

full rationale

The abstract describes a reduction from Stackelberg learning to linear contextual bandits where a utility vector is recommended and inverted to a mixed strategy. This step is presented as a methodological mapping whose regret transfer is asserted to hold, without any quoted self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. Standard bandit regret bounds are treated as external inputs, and no equations or prior-author uniqueness theorems are invoked to force the O(sqrt(T)) result by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the reduction step implicitly assumes standard properties of mixed strategies and best-response oracles that are standard in the domain.

pith-pipeline@v0.9.0 · 5712 in / 1050 out tokens · 23255 ms · 2026-05-23T03:33:57.241409+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader’s mixed strategy.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide learning algorithms for the leader which achieve O(T^{1/2}) regret under bandit feedback

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

work page 2011
[2]

The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

Francesco Bacchiocchi, Matteo Bollini, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

work page arXiv 2024
[3]

Commitment without regrets: Online learning in stackelberg security games

Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. InProceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015

work page 2015
[4]

Optimal rates and efficient algorithms for online bayesian persuasion

Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023

work page 2023
[5]

Selling to a no-regret buyer

Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

work page 2018
[6]

Online bayesian persuasion

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion. Advances in neural information processing systems, 33:16188–16198, 2020

work page 2020
[7]

Computing the optimal strategy to commit to

Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce, pages 82–90, 2006

work page 2006
[8]

Learning in auctions: Regret is hard, envy is easy

Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In2016 ieee 57th annual symposium on foundations of computer science (focs), pages 219–228. IEEE, 2016. 12

work page 2016
[9]

Strategizing against no-regret learners

Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019

work page 2019
[10]

Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

Kate Donahue, Nicole Immorlica, Meena Jagadeesan, Brendan Lucier, and Aleksandrs Slivkins. Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

work page arXiv 2024
[11]

Strategic classification from revealed preferences

Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 55–70, 2018

work page 2018
[12]

Performative prediction: Past and future

Moritz Hardt and Celestine Mendler-Dünner. Performative prediction: Past and future. arXiv preprint arXiv:2310.16608, 2023

work page arXiv 2023
[13]

Strategic classification

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111–122, 2016

work page 2016
[14]

Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

Keegan Harris, Zhiwei Steven Wu, and Maria-Florina Balcan. Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

work page arXiv 2024
[15]

Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

work page 2019
[16]

Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

work page 2011
[17]

Learning and approximating the optimal strategy to commit to

Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. InAlgorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pages 250–262. Springer, 2009

work page 2009
[18]

Catcher-Evader Games

Yuqian Li, Vincent Conitzer, and Dmytro Korzhyk. Catcher-evader games.arXiv preprint arXiv:1602.01896, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[20]

Learning optimal strategies to commit to

Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019

work page 2019
[21]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020

work page 2020
[22]

Bistro: An efficient relaxation-based method for contextual bandits

Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextual bandits. InInternational Conference on Machine Learning, pages 1977–1985. PMLR, 2016

work page 1977
[23]

Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004
[24]

Efficient algorithms for adversarial contextual learning

Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. InInternational Conference on Machine Learning, pages 2159–2168. PMLR, 2016. 13

work page 2016
[25]

Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[26]

Online learning in stackelberg games with an omniscient follower

Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. InInternational Conference on Machine Learning, pages 42304–42316. PMLR, 2023. 14 A Appendix for Section 3: A Reduction to Linear Contextual Bandits Theorem 3.1. When R is instantiated as the OFUL algorithm of Abbasi-Yadkori et al.[1...

work page 2023

[1] [1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

work page 2011

[2] [2]

The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

Francesco Bacchiocchi, Matteo Bollini, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

work page arXiv 2024

[3] [3]

Commitment without regrets: Online learning in stackelberg security games

Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. InProceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015

work page 2015

[4] [4]

Optimal rates and efficient algorithms for online bayesian persuasion

Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023

work page 2023

[5] [5]

Selling to a no-regret buyer

Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

work page 2018

[6] [6]

Online bayesian persuasion

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion. Advances in neural information processing systems, 33:16188–16198, 2020

work page 2020

[7] [7]

Computing the optimal strategy to commit to

Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce, pages 82–90, 2006

work page 2006

[8] [8]

Learning in auctions: Regret is hard, envy is easy

Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In2016 ieee 57th annual symposium on foundations of computer science (focs), pages 219–228. IEEE, 2016. 12

work page 2016

[9] [9]

Strategizing against no-regret learners

Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019

work page 2019

[10] [10]

Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

Kate Donahue, Nicole Immorlica, Meena Jagadeesan, Brendan Lucier, and Aleksandrs Slivkins. Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

work page arXiv 2024

[11] [11]

Strategic classification from revealed preferences

Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 55–70, 2018

work page 2018

[12] [12]

Performative prediction: Past and future

Moritz Hardt and Celestine Mendler-Dünner. Performative prediction: Past and future. arXiv preprint arXiv:2310.16608, 2023

work page arXiv 2023

[13] [13]

Strategic classification

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111–122, 2016

work page 2016

[14] [14]

Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

Keegan Harris, Zhiwei Steven Wu, and Maria-Florina Balcan. Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

work page arXiv 2024

[15] [15]

Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

work page 2019

[16] [16]

Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

work page 2011

[17] [17]

Learning and approximating the optimal strategy to commit to

Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. InAlgorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pages 250–262. Springer, 2009

work page 2009

[18] [18]

Catcher-Evader Games

Yuqian Li, Vincent Conitzer, and Dmytro Korzhyk. Catcher-evader games.arXiv preprint arXiv:1602.01896, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[20] [20]

Learning optimal strategies to commit to

Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019

work page 2019

[21] [21]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020

work page 2020

[22] [22]

Bistro: An efficient relaxation-based method for contextual bandits

Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextual bandits. InInternational Conference on Machine Learning, pages 1977–1985. PMLR, 2016

work page 1977

[23] [23]

Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

work page 2004

[24] [24]

Efficient algorithms for adversarial contextual learning

Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. InInternational Conference on Machine Learning, pages 2159–2168. PMLR, 2016. 13

work page 2016

[25] [25]

Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[26] [26]

Online learning in stackelberg games with an omniscient follower

Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. InInternational Conference on Machine Learning, pages 42304–42316. PMLR, 2023. 14 A Appendix for Section 3: A Reduction to Linear Contextual Bandits Theorem 3.1. When R is instantiated as the OFUL algorithm of Abbasi-Yadkori et al.[1...

work page 2023