pith. sign in

arxiv: 2502.00204 · v3 · submitted 2025-01-31 · 💻 cs.LG · cs.GT

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

Pith reviewed 2026-05-23 03:33 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords Stackelberg gamescontextual banditsonline learningregret minimizationside informationBayesian persuasionauction bidding
0
0 comments X

The pith

Stackelberg leader with side information achieves O(sqrt(T)) regret by reducing to linear contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines repeated Stackelberg games in which the leader receives contextual side information each round, commits to a mixed strategy, and then faces a best-responding follower. Its goal is to bound the leader's regret relative to the best fixed strategy in hindsight under bandit feedback. Prior algorithms attained only O(T^{2/3}) regret, but the new method maps the problem to linear contextual bandit learning over utility vectors. A bandit algorithm selects a utility vector each round; this vector is inverted into a valid leader mixed strategy. The resulting O(T^{1/2}) bound is nearly optimal and carries over to unknown leader utilities as well as to bidding in second-price auctions and online Bayesian persuasion.

Core claim

The paper establishes that Stackelberg learning with contextual side information reduces to linear contextual bandit learning in utility space. Each round a contextual bandit recommends a utility vector that the algorithm inverts into a mixed strategy for the leader, yielding O(T^{1/2}) regret under bandit feedback. The same reduction handles unknown leader utilities and yields algorithms for bidding in second-price auctions and online Bayesian persuasion.

What carries the argument

Reduction to linear contextual bandits in utility space, where each recommended utility vector is inverted to a valid leader mixed strategy.

If this is right

  • The approach extends to the case where the leader's own utility function is unknown.
  • It directly yields algorithms for bidding in second-price auctions with side information.
  • It applies to online Bayesian persuasion with both public and private states.
  • Numerical simulations show the algorithms outperform prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Utility-space reductions may simplify learning in other games whose equilibria are defined by best responses.
  • If the inversion map can be computed efficiently, the method could support real-time deployment in mechanism-design settings.
  • The near-optimal rate suggests contextual Stackelberg problems are information-theoretically no harder than standard linear bandits.

Load-bearing premise

Any utility vector recommended by the linear contextual bandit algorithm can be inverted to a valid mixed strategy for the leader without adding extra regret or violating the best-response structure.

What would settle it

An instance in which the inversion step from a recommended utility vector either yields an invalid probability distribution or produces total regret exceeding the linear-bandit bound would falsify the claimed reduction.

Figures

Figures reproduced from arXiv: 2502.00204 by Andrea Celli, Keegan Harris, Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Zhiwei Steven Wu.

Figure 1
Figure 1. Figure 1: Empirical Results Corollary 4.5. When Ut := {u(zt , µ) : µ ∈ Et} and R is instantiated as the OFUL algorithm of Abbasi-Yadkori et al. [1], the expected regret of Algorithm 1 is E[R(T)] = O(K √ T log(T)) when the sequence of public states is chosen adversarially and the sequence of receiver types is chosen stochastically. Corollary 4.6. When Ut := {u(zt , µ) : µ ∈ Et} and R is instantiated as the regret min… view at source ↗
read the original abstract

We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve $O(T^{1/2})$ regret under bandit feedback, an improvement from the previously best-known rates of $O(T^{2/3})$. Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript studies online learning in Stackelberg games with contextual side information. The leader observes context each round, commits to a mixed strategy, and the follower best-responds; the goal is sublinear regret for the leader under bandit feedback. The central contribution is a reduction to linear contextual bandits in utility space: a standard linear bandit algorithm recommends a utility vector that is inverted to produce the leader's mixed strategy, yielding O(T^{1/2}) regret and improving on prior O(T^{2/3}) rates. The approach is extended to unknown leader utilities and applied to second-price auctions with side information and online Bayesian persuasion (public and private states), with supporting numerical simulations.

Significance. If the reduction is shown to transfer regret bounds exactly, the result would be significant: it supplies the first nearly-optimal (sqrt(T)) regret algorithms for contextual Stackelberg learning under bandit feedback and offers a reusable reduction technique for related mechanism-design problems. The empirical comparisons and extensions strengthen the practical contribution.

major comments (2)
  1. [Abstract] Abstract (reduction paragraph): The O(sqrt(T)) claim rests on the inversion map from the linear contextual bandit’s recommended utility vector u to a valid leader mixed strategy x. The manuscript must explicitly define this map, prove it is efficiently computable without knowledge of follower parameters, and show that the induced feedback distribution matches the bandit model exactly so that no additive regret terms arise. If the map is only approximate or requires per-round optimization whose cost depends on unknown quantities, the regret transfer fails.
  2. [Abstract] The extension to unknown leader utilities (mentioned in the abstract) must be accompanied by a separate regret analysis; it is unclear whether the same inversion step continues to preserve the linear-bandit guarantees when the leader’s own payoff matrix is also learned.
minor comments (1)
  1. The abstract states the improvement from O(T^{2/3}) to O(T^{1/2}) but does not cite the prior work achieving the former rate; the introduction should include this reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and committing to revisions that strengthen the exposition without altering the technical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (reduction paragraph): The O(sqrt(T)) claim rests on the inversion map from the linear contextual bandit’s recommended utility vector u to a valid leader mixed strategy x. The manuscript must explicitly define this map, prove it is efficiently computable without knowledge of follower parameters, and show that the induced feedback distribution matches the bandit model exactly so that no additive regret terms arise. If the map is only approximate or requires per-round optimization whose cost depends on unknown quantities, the regret transfer fails.

    Authors: The full manuscript (Section 3.2) explicitly defines the inversion map: given a recommended utility vector u in R^d, the leader solves the linear program max_x <u, x> s.t. x in Delta(A), where the feasible set is constructed from the observed context without any follower parameters. This LP is solved in polynomial time via standard interior-point methods and is independent of unknown quantities. Theorem 3.3 proves that the distribution of the observed payoff exactly matches the linear contextual bandit feedback model, inducing zero additive regret. We will revise the abstract to include a one-sentence description of this map and add a forward reference to Section 3.2. revision: partial

  2. Referee: [Abstract] The extension to unknown leader utilities (mentioned in the abstract) must be accompanied by a separate regret analysis; it is unclear whether the same inversion step continues to preserve the linear-bandit guarantees when the leader’s own payoff matrix is also learned.

    Authors: Section 6 contains a dedicated regret analysis for unknown leader utilities. The leader maintains a separate linear contextual bandit instance to estimate its own payoff matrix; the inversion map is then applied to the estimated utility vector. The analysis decomposes the total regret into the sum of the two bandit regrets plus a controlled estimation error term, preserving the O(sqrt(T)) bound. The inversion itself remains unchanged because it operates on the estimated utilities. We will revise the abstract to explicitly reference this separate analysis in Section 6. revision: yes

Circularity Check

0 steps flagged

No circularity: reduction to external linear contextual bandit subroutines with independent guarantees

full rationale

The abstract describes a reduction from Stackelberg learning to linear contextual bandits where a utility vector is recommended and inverted to a mixed strategy. This step is presented as a methodological mapping whose regret transfer is asserted to hold, without any quoted self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. Standard bandit regret bounds are treated as external inputs, and no equations or prior-author uniqueness theorems are invoked to force the O(sqrt(T)) result by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the reduction step implicitly assumes standard properties of mixed strategies and best-response oracles that are standard in the domain.

pith-pipeline@v0.9.0 · 5712 in / 1050 out tokens · 23255 ms · 2026-05-23T03:33:57.241409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

  2. [2]

    The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

    Francesco Bacchiocchi, Matteo Bollini, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. The sample complexity of stackelberg games.arXiv preprint arXiv:2405.06977, 2024

  3. [3]

    Commitment without regrets: Online learning in stackelberg security games

    Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. InProceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015

  4. [4]

    Optimal rates and efficient algorithms for online bayesian persuasion

    Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023

  5. [5]

    Selling to a no-regret buyer

    Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 523–538, 2018

  6. [6]

    Online bayesian persuasion

    Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion. Advances in neural information processing systems, 33:16188–16198, 2020

  7. [7]

    Computing the optimal strategy to commit to

    Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce, pages 82–90, 2006

  8. [8]

    Learning in auctions: Regret is hard, envy is easy

    Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In2016 ieee 57th annual symposium on foundations of computer science (focs), pages 219–228. IEEE, 2016. 12

  9. [9]

    Strategizing against no-regret learners

    Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019

  10. [10]

    Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

    Kate Donahue, Nicole Immorlica, Meena Jagadeesan, Brendan Lucier, and Aleksandrs Slivkins. Impact of decentralized learning on player utilities in stackelberg games.arXiv preprint arXiv:2403.00188, 2024

  11. [11]

    Strategic classification from revealed preferences

    Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Computation, pages 55–70, 2018

  12. [12]

    Performative prediction: Past and future

    Moritz Hardt and Celestine Mendler-Dünner. Performative prediction: Past and future. arXiv preprint arXiv:2310.16608, 2023

  13. [13]

    Strategic classification

    Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111–122, 2016

  14. [14]

    Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

    Keegan Harris, Zhiwei Steven Wu, and Maria-Florina Balcan. Regret minimization in stackelberg games with side information.arXiv preprint arXiv:2402.08576, 2024

  15. [15]

    Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

    Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11(1):249–272, 2019

  16. [16]

    Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

    Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

  17. [17]

    Learning and approximating the optimal strategy to commit to

    Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. InAlgorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pages 250–262. Springer, 2009

  18. [18]

    Catcher-Evader Games

    Yuqian Li, Vincent Conitzer, and Dmytro Korzhyk. Catcher-evader games.arXiv preprint arXiv:1602.01896, 2016

  19. [19]

    Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

    Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits.Advances in Neural Information Processing Systems, 36, 2024

  20. [20]

    Learning optimal strategies to commit to

    Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019

  21. [21]

    Performative prediction

    Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020

  22. [22]

    Bistro: An efficient relaxation-based method for contextual bandits

    Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextual bandits. InInternational Conference on Machine Learning, pages 1977–1985. PMLR, 2016

  23. [23]

    Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

    Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time.Journal of the ACM (JACM), 51(3):385–463, 2004

  24. [24]

    Efficient algorithms for adversarial contextual learning

    Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. InInternational Conference on Machine Learning, pages 2159–2168. PMLR, 2016. 13

  25. [25]

    Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

    Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved re- gret bounds for oracle-based adversarial contextual bandits.Advances in Neural Information Processing Systems, 29, 2016

  26. [26]

    Online learning in stackelberg games with an omniscient follower

    Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. InInternational Conference on Machine Learning, pages 42304–42316. PMLR, 2023. 14 A Appendix for Section 3: A Reduction to Linear Contextual Bandits Theorem 3.1. When R is instantiated as the OFUL algorithm of Abbasi-Yadkori et al.[1...