pith. sign in

arxiv: 2605.15331 · v1 · pith:D3AOF3LZnew · submitted 2026-05-14 · 💻 cs.GT

Learning to Persuade a Biased Receiver

Pith reviewed 2026-05-19 15:28 UTC · model grok-4.3

classification 💻 cs.GT
keywords information designpersuasionbiased belief updatingregret minimizationsafe explorationonline learningrepeated interactions
0
0 comments X

The pith

A sender can learn a receiver's fixed bias in belief updating through safe exploration while achieving O(log log T) regret against a full-information oracle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates repeated persuasion where the receiver mixes the prior and Bayesian posterior according to an unknown fixed bias parameter. The sender selects signaling schemes each round and only observes the receiver's action, seeking to minimize regret compared to knowing the bias. The safe exploration algorithm uses probes that limit the downside of errors by being conservative, allowing estimation of the bias with minimal impact on persuasion success. This results in regret that grows very slowly as O(log log T), which is shown to be optimal by a matching lower bound. Such a result indicates that effective learning is possible in information design settings even when feedback is limited to actions.

Core claim

For general finite state and action spaces and arbitrary bounded utilities, the safe exploration algorithm achieves O(log log T) regret relative to a full-information oracle that knows the receiver's biased updating rule, with a matching Omega(log log T) lower bound. The algorithm exploits the asymmetric cost of probing: conservative probes incur only local loss, whereas overly aggressive probes may lose the persuasive opportunity entirely.

What carries the argument

The safe exploration algorithm that exploits asymmetric probing costs to learn the single unknown bias parameter while preserving high persuasion value.

If this is right

  • The O(log log T) regret bound holds for arbitrary finite state and action spaces with bounded utilities.
  • The algorithm maintains persuasive effectiveness during the learning phase.
  • Analysis of receiver welfare follows from the learned bias and chosen signals.
  • Similar regret rates apply in extensions with jointly unknown priors or contextual time-varying environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Techniques like this could inform the design of adaptive systems in marketing or policy where user biases need to be inferred from responses.
  • The optimality of log log regret suggests that very long horizons are needed to fully amortize the learning cost in persuasion.
  • Extensions to dynamic biases or multiple receivers might require new probing strategies to retain the same rate.

Load-bearing premise

The receiver applies the same bias to update beliefs in every round and the sender perfectly observes each chosen action but gets no other feedback.

What would settle it

A simulation or experiment tracking cumulative regret over increasing T and verifying whether it stays within O(log log T) or violates the rate would confirm or disprove the bound.

Figures

Figures reproduced from arXiv: 2605.15331 by Milind Tambe, Sadie Zhao, Yiling Chen, Yuqi Pan.

Figure 1
Figure 1. Figure 1: Binary instance B = (µ0, qth) = (0.25, 0.60) and true bias α ⋆ = 0.70. inducing action 1. On the receiver side, action 0 is set as optimal at the prior without loss of generality, otherwise the sender can induce action 1 without learning and regret is zero. Since the receiver’s utility difference between actions 1 and 0 is affine in the distorted belief, the action follows a cutoff rule. Let qth ∈ (µ0, 1) … view at source ↗
Figure 2
Figure 2. Figure 2: Action regions for µ0 = (0.50, 0.27, 0.23). In state order (e0, e1, e2), uR(a0, ·) = (0, 0, 0), uR(a1, ·) = (−2.1, 0.3, 0.9), and uR(a2, ·) = (−2.1, −0.3, 1.5). We now specify the condition for tie-breaking rules. Define the set of relevant actions: Arel = n a ∈ A : ∃ν ∈ ∆(Ω) that a uniquely maximizes X ω∈Ω (α ⋆ ν(ω)+(1−α ⋆ )µ0(ω))uR(a, ω) o . Actions outside Arel are weakly dominated for the receiver: the… view at source ↗
read the original abstract

We study a repeated information design setting in which the receiver, who is also the decision-maker, updates beliefs in a systematically biased way. More specifically, a distorted posterior in our model can be written as a convex combination of the prior and the Bayesian posterior, governed by a fixed but unknown parameter. Over repeated interactions, the sender chooses persuasive signaling schemes, observes only the receiver's realized actions, and seeks to minimize regret relative to a full-information oracle that knows the receiver's biased updating rule. We propose a safe exploration algorithm for learning the receiver's bias while maintaining high persuasion value. The algorithm exploits the asymmetric cost of probing: conservative probes incur only local loss, whereas overly aggressive probes may lose the persuasive opportunity entirely. For general finite state and action spaces and arbitrary bounded utilities, our method achieves $O(\log\log T)$ regret. A matching $\Omega(\log\log T)$ lower bound shows that this rate is optimal. We further discuss the influence on receiver welfare, as well as extensions to jointly unknown prior and bias, and contextual settings with time-varying priors and utilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies repeated persuasion where a sender designs signals for a receiver who updates beliefs as a convex combination of the prior and Bayesian posterior, governed by a fixed unknown bias parameter. The sender observes only realized actions, chooses persuasive schemes each round, and minimizes regret relative to a full-information oracle knowing the bias. For general finite state/action spaces and arbitrary bounded utilities, a safe exploration algorithm is proposed that exploits asymmetric probing costs to achieve O(log log T) regret, with a matching Omega(log log T) lower bound. Extensions to unknown priors, contextual settings, and receiver welfare are discussed.

Significance. If the analysis holds, the result is significant for online learning in information design: it establishes an optimal double-logarithmic regret rate under model misspecification on the receiver side, enabled by conservative probes that incur only local loss. The matching lower bound and the general finite-space setting with arbitrary utilities are strengths. The paper also provides concrete algorithmic design and discusses welfare implications, which could inform future work on strategic communication and mechanism design with learning.

major comments (2)
  1. [regret analysis of safe exploration algorithm] Regret analysis for the safe exploration procedure: the O(log log T) upper bound requires that conservative probes refine the bias estimate at double-exponential speed while remaining persuasive after each update. For arbitrary bounded utilities the receiver's best response can be discontinuous in the perceived posterior; it is unclear whether the probe construction guarantees that an updated estimate keeps the chosen signal persuasive and informative, which is load-bearing for obtaining the claimed rate rather than a slower polynomial rate.
  2. [lower bound section] Lower bound construction: the Omega(log log T) lower bound relies on the necessity of safe exploration and that discontinuities in best responses do not permit faster learning. A verification that the construction carries through for general utilities (where discontinuities might alter the information-gathering speed) would be needed to confirm optimality.
minor comments (2)
  1. [abstract] Abstract: the precise mathematical definition of the biased posterior (convex combination parameter) should be stated explicitly rather than described only in words.
  2. [algorithm description] The manuscript would benefit from a short illustrative example with discontinuous best response to demonstrate that the safe probes remain valid after estimate updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our paper. We address each major comment below, providing clarifications on the regret analysis and lower bound. We believe these responses resolve the concerns while maintaining the integrity of our results for general settings.

read point-by-point responses
  1. Referee: Regret analysis for the safe exploration procedure: the O(log log T) upper bound requires that conservative probes refine the bias estimate at double-exponential speed while remaining persuasive after each update. For arbitrary bounded utilities the receiver's best response can be discontinuous in the perceived posterior; it is unclear whether the probe construction guarantees that an updated estimate keeps the chosen signal persuasive and informative, which is load-bearing for obtaining the claimed rate rather than a slower polynomial rate.

    Authors: We appreciate this observation. In Section 4 of the manuscript, the safe exploration algorithm constructs probes that are persuasive for all bias parameters in the current uncertainty interval. This is achieved by solving for signals that induce the desired action for the worst-case bias in the interval, ensuring that updates to the estimate (which refine the interval) keep the signal persuasive because the new interval is a subset. Although best responses may be discontinuous, the persuasion condition is maintained within the regions where the action is fixed, and our algorithm avoids crossing discontinuity points by using conservative probes. The double-exponential convergence is due to the multiplicative reduction in uncertainty per probe. To make this clearer, we will add a remark in the main text and expand the proof in the appendix to explicitly handle the discontinuity cases for arbitrary utilities. revision: partial

  2. Referee: Lower bound construction: the Omega(log log T) lower bound relies on the necessity of safe exploration and that discontinuities in best responses do not permit faster learning. A verification that the construction carries through for general utilities (where discontinuities might alter the information-gathering speed) would be needed to confirm optimality.

    Authors: For the lower bound in Section 5, the construction uses a specific family of utility functions where the best response discontinuities are present but do not allow bypassing the safe exploration requirement. The information-theoretic argument shows that to achieve low regret, the algorithm must perform probes that safely narrow the bias parameter, and any attempt to use aggressive probes risks high regret due to potential loss of persuasion. This holds for general utilities because the lower bound instance is embedded in the general setting, and discontinuities only increase the difficulty of learning without safe probes. We can provide additional details in the revision to verify the construction explicitly for discontinuous cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity; regret bound derived independently of fitted bias parameter

full rationale

The paper presents a safe exploration algorithm that learns a single fixed scalar bias parameter through conservative probes while maintaining persuasion. The O(log log T) upper bound and matching lower bound are obtained via standard analysis of asymmetric-cost exploration against an external full-information oracle benchmark. No step reduces the claimed regret expression to a quantity defined by the same bias parameter, nor does the derivation rely on self-citation chains, imported uniqueness theorems, or ansatz smuggling. The central result remains self-contained with independent content for general finite spaces and arbitrary utilities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling assumption that the bias is a single fixed scalar governing a convex combination of prior and posterior, plus standard assumptions on finite spaces and bounded utilities. No free parameters are fitted inside the algorithm itself; the bias is the unknown to be learned.

axioms (2)
  • domain assumption Receiver's distorted posterior is a convex combination of prior and Bayesian posterior with fixed unknown weight
    Stated in the abstract as the model of biased updating
  • domain assumption Sender observes only the realized action after each signal
    Explicit in the problem setup

pith-pipeline@v0.9.0 · 5718 in / 1345 out tokens · 28368 ms · 2026-05-19T15:28:38.900846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    MIT press, 1995

    Robert J Aumann, Michael Maschler, and Richard E Stearns.Repeated games with incomplete information. MIT press, 1995

  2. [2]

    Over-and underreaction to information: Belief updating with cognitive constraints

    Cuimin Ba, J Aislinn Bohren, and Alex Imas. Over-and underreaction to information: Belief updating with cognitive constraints. Technical report, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, 2025

  3. [3]

    Markov persuasion processes: Learning to persuade from scratch.arXiv preprint arXiv:2402.03077, 2024

    Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Markov persuasion processes: Learning to persuade from scratch.arXiv preprint arXiv:2402.03077, 2024

  4. [4]

    A meta-analysis of the weight of advice in decision-making.Current Psychology, 42(28): 24516–24541, 2023

    Phoebe E Bailey, Tarren Leon, Natalie C Ebner, Ahmed A Moustafa, and Gabrielle Weidemann. A meta-analysis of the weight of advice in decision-making.Current Psychology, 42(28): 24516–24541, 2023

  5. [5]

    The base-rate fallacy in probability judgments.Acta Psychologica, 44(3): 211–233, 1980

    Maya Bar-Hillel. The base-rate fallacy in probability judgments.Acta Psychologica, 44(3): 211–233, 1980

  6. [6]

    Errors in probabilistic reasoning and judgment biases.Handbook of Behavioral Economics: Applications and Foundations 1, 2:69–186, 2019

    Daniel J Benjamin. Errors in probabilistic reasoning and judgment biases.Handbook of Behavioral Economics: Applications and Foundations 1, 2:69–186, 2019

  7. [7]

    Information design, bayesian persuasion, and bayes correlated equilibrium.American Economic Review, 106(5):586–591, 2016

    Dirk Bergemann and Stephen Morris. Information design, bayesian persuasion, and bayes correlated equilibrium.American Economic Review, 106(5):586–591, 2016

  8. [8]

    Equivalent comparisons of experiments.The annals of mathematical statistics, pages 265–272, 1953

    David Blackwell. Equivalent comparisons of experiments.The annals of mathematical statistics, pages 265–272, 1953

  9. [9]

    Online bayesian persua- sion.Advances in neural information processing systems, 33:16188–16198, 2020

    Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persua- sion.Advances in neural information processing systems, 33:16188–16198, 2020

  10. [10]

    Multi-receiver online bayesian persuasion

    Matteo Castiglioni, Alberto Marchesi, Andrea Celli, and Nicola Gatti. Multi-receiver online bayesian persuasion. InInternational Conference on Machine Learning, pages 1314–1323. PMLR, 2021

  11. [11]

    Bias detection via signaling.Advances in Neural Information Processing Systems, 37:69120–69143, 2024

    Yiling Chen, Tao Lin, Ariel D Procaccia, Aaditya Ramdas, and Itai Shapira. Bias detection via signaling.Advances in Neural Information Processing Systems, 37:69120–69143, 2024

  12. [12]

    Learning to signal in the goldilocks zone: Improving adversary compliance in security games

    Sarah Cooney, Kai Wang, Elizabeth Bondi, Thanh Nguyen, Phebe Vayanos, Hailey Winetrobe, Edward A Cranford, Cleotilde Gonzalez, Christian Lebiere, and Milind Tambe. Learning to signal in the goldilocks zone: Improving adversary compliance in security games. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 725–740....

  13. [13]

    Toward personalized deceptive signaling for cyber defense using cognitive models.Topics in Cognitive Science, 12(3):992–1011, 2020

    Edward A Cranford, Cleotilde Gonzalez, Palvi Aggarwal, Sarah Cooney, Milind Tambe, and Christian Lebiere. Toward personalized deceptive signaling for cyber defense using cognitive models.Topics in Cognitive Science, 12(3):992–1011, 2020

  14. [14]

    A theory of learning to infer.Psychological review, 127(3):412, 2020

    Ishita Dasgupta, Eric Schulz, Joshua B Tenenbaum, and Samuel J Gershman. A theory of learning to infer.Psychological review, 127(3):412, 2020

  15. [15]

    Non-bayesian persuasion.Journal of Political Economy, 130(10):2594–2642, 2022

    Geoffroy De Clippel and Xu Zhang. Non-bayesian persuasion.Journal of Political Economy, 130(10):2594–2642, 2022

  16. [16]

    Algorithm aversion: people erroneously avoid algorithms after seeing them err.Journal of experimental psychology: General, 144(1):114, 2015

    Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: people erroneously avoid algorithms after seeing them err.Journal of experimental psychology: General, 144(1):114, 2015

  17. [17]

    Algorithmic bayesian persuasion

    Shaddin Dughmi and Haifeng Xu. Algorithmic bayesian persuasion. InProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 412–425, 2016

  18. [18]

    Algorithmic persuasion with no externalities

    Shaddin Dughmi and Haifeng Xu. Algorithmic persuasion with no externalities. InProceedings of the 2017 ACM Conference on Economics and Computation, pages 351–368, 2017. 11

  19. [19]

    Conservatism in human information processing.Formal representation of human judgment, 1968

    Ward Edwards. Conservatism in human information processing.Formal representation of human judgment, 1968

  20. [20]

    Non-bayesian learning.The BE Journal of Theoretical Economics, 10(1):1–20, 2010

    Larry G Epstein, Jawwad Noor, Alvaro Sandroni, et al. Non-bayesian learning.The BE Journal of Theoretical Economics, 10(1):1–20, 2010

  21. [21]

    Online bayesian recommendation with no regret

    Yiding Feng, Wei Tang, and Haifeng Xu. Online bayesian recommendation with no regret. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 818–819, 2022

  22. [22]

    Rationality-robust information design: Bayesian persuasion under quantal response

    Yiding Feng, Chien-Ju Ho, and Wei Tang. Rationality-robust information design: Bayesian persuasion under quantal response. InProceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 501–546. SIAM, 2024

  23. [23]

    A rothschild-stiglitz approach to bayesian persuasion

    Matthew Gentzkow and Emir Kamenica. A rothschild-stiglitz approach to bayesian persuasion. American Economic Review, 106(5):597–601, 2016

  24. [24]

    Bayes rule as a descriptive model: The representativeness heuristic.The Quarterly journal of economics, 95(3):537–557, 1980

    David M Grether. Bayes rule as a descriptive model: The representativeness heuristic.The Quarterly journal of economics, 95(3):537–557, 1980

  25. [25]

    Persuasion with motivated beliefs

    David Hagmann and George Loewenstein. Persuasion with motivated beliefs. InOpinion Dynamics & Collective Decisions Workshop, 2017

  26. [26]

    Algorithmic persuasion through simulation.arXiv preprint arXiv:2311.18138, 2023

    Keegan Harris, Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Algorithmic persuasion through simulation.arXiv preprint arXiv:2311.18138, 2023

  27. [27]

    Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

    Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011

  28. [28]

    The value of knowing a demand curve: Bounds on regret for online posted-price auctions

    Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 594–605. IEEE, 2003

  29. [29]

    Dynamic Non-Bayesian Persuasion

    Masanori Kobayashi. Dynamic non-bayesian persuasion.arXiv preprint arXiv:2508.12328, 2025

  30. [30]

    Information design with unknown prior.arXiv preprint arXiv:2410.05533, 2024

    Ce Li and Tao Lin. Information design with unknown prior.arXiv preprint arXiv:2410.05533, 2024

  31. [31]

    What influences algorithmic decision-making? a systematic literature review on algorithm aversion

    Hasan Mahmud, AKM Najmul Islam, Syed Ishtiaque Ahmed, and Kari Smolander. What influences algorithmic decision-making? a systematic literature review on algorithm aversion. Technological forecasting and social change, 175:121390, 2022

  32. [32]

    Optimal information disclosure.Journal of political Economy, 118 (5):949–987, 2010

    Luis Rayo and Ilya Segal. Optimal information disclosure.Journal of political Economy, 118 (5):949–987, 2010

  33. [33]

    Motivated skepticism in the evaluation of political beliefs

    Charles S Taber and Milton Lodge. Motivated skepticism in the evaluation of political beliefs. American journal of political science, 50(3):755–769, 2006

  34. [34]

    On the bayesian rational assumption in information design

    Wei Tang and Chien-Ju Ho. On the bayesian rational assumption in information design. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 9, pages 120–130, 2021

  35. [35]

    Bayesian or biased? analytic thinking and political belief updating.Cognition, 204:104375, 2020

    Ben M Tappin, Gordon Pennycook, and David G Rand. Bayesian or biased? analytic thinking and political belief updating.Cognition, 204:104375, 2020

  36. [36]

    Sequential information design: Markov persuasion process and its efficient reinforcement learning.arXiv preprint arXiv:2202.10678, 2022

    Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning.arXiv preprint arXiv:2202.10678, 2022

  37. [37]

    Probabilistic computations: Toward a unified measure of complexity

    Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE Computer Society, 1977

  38. [38]

    SkX t=1 Ht # =E

    You Zu, Krishnamurthy Iyer, and Haifeng Xu. Learning to persuade on the fly: Robustness against ignorance. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 927–928, 2021. 12 A Related Work (Detailed Discussion) A large body of empirical and theoretical work shows that Bayesian updating is a useful normative benchmark but often ...

  39. [39]

    Each physical round contributes at most ∆Umax regret

    Therefore E[Xk]≤ √ 2 δµ0 ,E " NX k=1 Xk # ≤ √ 2N δµ0 =O(log logT). Each physical round contributes at most ∆Umax regret. The hidden constant here is √ 2∆Umax/δµ0. C.3 Relevant Actions Recall the relevant-action set Arel = n a∈A:∃ν∈∆(Ω)thatauniquely maximizes X ω∈Ω (α⋆ν(ω)+(1−α ⋆)µ0(ω))uR(a, ω) o . The following discussion shows that without excluding thos...

  40. [40]

    28 Also, bλ+ ≤ 2∥r∥2 δµ0 ≤ 2pϵ δµ0

    =µ 0. 28 Also, bλ+ ≤ 2∥r∥2 δµ0 ≤ 2pϵ δµ0 . For utility, write va(ν) := X ω∈Ω ν(ω)u S(a, ω). Then |va(ν)−v a(ν′)| ≤U max p |Ω| ∥ν−ν ′∥2. Therefore, |US(τ)−U S(bτ)| ≤Umax p |Ω| X ν′ i̸=νi λi∥νi −ν ′ i∥2 +U max X i |λi −bλi|+U maxbλ+ ≤U max p |Ω|pϵ+ 2U maxbλ+ ≤U max p |Ω|pϵ+ 4Umax δµ0 pϵ. This proves the claim. C.4.3 Proof of Proposition 3 Since τopt ={(λ i,...

  41. [41]

    The true bias is detectable in every context:α ⋆ ≥¯αmin := maxx∈X αmin(Ix)

  42. [42]

    The sender utilities are uniformly bounded:U X max := maxx∈X maxa,ω |uS,x(a, ω)|<∞

  43. [43]

    For every contextx, the analogue of Proposition 5 holds, and there exists a positive constant c= min x cx >0, ϵ= min x ϵx >0 for the first branch in Proposition 5, and a positive constantδ= min x δx >0for the second branch

  44. [44]

    Equivalently, there exist positive constant δµ0 = min x∈X δµ0,x and finite constantsκ, G max, Lb satisfying κ= max x∈X κx, G max = max x∈X Gmax,x, L b = max x∈X Lb,x

    The constants in Theorem 2 can be chosen uniformly over x∈ X , see more details in Section C.1. Equivalently, there exist positive constant δµ0 = min x∈X δµ0,x and finite constantsκ, G max, Lb satisfying κ= max x∈X κx, G max = max x∈X Gmax,x, L b = max x∈X Lb,x. Note that the above assumption naturally holds if the context setXis finite. Contextual algori...

  45. [45]

    2:Initialize:interval[a, b]←[0,1], step sizeε← 1 2, timet←1

    the receiver takes action 1 if and only if the distorted posterior is at leastbq, where 0< µ ∗ < bq <1; 38 Algorithm 9Safe Exploration for Unknown Bias and Unknown Prior 1:Input:horizonT, thresholdbq. 2:Initialize:interval[a, b]←[0,1], step sizeε← 1 2, timet←1. 3:Stage 1: Safe exploration 4:whileb−a > T −1 andt≤Tdo 5:m←a,m prev ←a 6:whilem≤bandt≤Tdo 7:Com...

  46. [46]

    persuasion is feasible, namely α∗ ≥α min = bq−µ∗ 1−µ ∗ so thatm ∗ ∈[0,1]is well-defined

  47. [47]

    LetΠ SEJ denote Algorithm 9

    ties are broken in favor of action1. LetΠ SEJ denote Algorithm 9. Then, for everyT≥4, RegT ΠSEJ;α ∗, µ∗ =O(log logT). In particular, RegT ΠSEJ;α ∗, µ∗ ≤ 1−µ ∗ µ∗ + 1 (2 +⌈log 2 log2 T⌉) + 1. Proof. Let V ∗ denote the full-information benchmark value. As noted above, when (α⋆, µ∗) is known, an optimal scheme is supported on{0, ν ∗}, soV ∗ = µ∗ ν∗ =µ ∗ + (1...