Learning to Persuade a Biased Receiver
Pith reviewed 2026-05-19 15:28 UTC · model grok-4.3
The pith
A sender can learn a receiver's fixed bias in belief updating through safe exploration while achieving O(log log T) regret against a full-information oracle.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For general finite state and action spaces and arbitrary bounded utilities, the safe exploration algorithm achieves O(log log T) regret relative to a full-information oracle that knows the receiver's biased updating rule, with a matching Omega(log log T) lower bound. The algorithm exploits the asymmetric cost of probing: conservative probes incur only local loss, whereas overly aggressive probes may lose the persuasive opportunity entirely.
What carries the argument
The safe exploration algorithm that exploits asymmetric probing costs to learn the single unknown bias parameter while preserving high persuasion value.
If this is right
- The O(log log T) regret bound holds for arbitrary finite state and action spaces with bounded utilities.
- The algorithm maintains persuasive effectiveness during the learning phase.
- Analysis of receiver welfare follows from the learned bias and chosen signals.
- Similar regret rates apply in extensions with jointly unknown priors or contextual time-varying environments.
Where Pith is reading between the lines
- Techniques like this could inform the design of adaptive systems in marketing or policy where user biases need to be inferred from responses.
- The optimality of log log regret suggests that very long horizons are needed to fully amortize the learning cost in persuasion.
- Extensions to dynamic biases or multiple receivers might require new probing strategies to retain the same rate.
Load-bearing premise
The receiver applies the same bias to update beliefs in every round and the sender perfectly observes each chosen action but gets no other feedback.
What would settle it
A simulation or experiment tracking cumulative regret over increasing T and verifying whether it stays within O(log log T) or violates the rate would confirm or disprove the bound.
Figures
read the original abstract
We study a repeated information design setting in which the receiver, who is also the decision-maker, updates beliefs in a systematically biased way. More specifically, a distorted posterior in our model can be written as a convex combination of the prior and the Bayesian posterior, governed by a fixed but unknown parameter. Over repeated interactions, the sender chooses persuasive signaling schemes, observes only the receiver's realized actions, and seeks to minimize regret relative to a full-information oracle that knows the receiver's biased updating rule. We propose a safe exploration algorithm for learning the receiver's bias while maintaining high persuasion value. The algorithm exploits the asymmetric cost of probing: conservative probes incur only local loss, whereas overly aggressive probes may lose the persuasive opportunity entirely. For general finite state and action spaces and arbitrary bounded utilities, our method achieves $O(\log\log T)$ regret. A matching $\Omega(\log\log T)$ lower bound shows that this rate is optimal. We further discuss the influence on receiver welfare, as well as extensions to jointly unknown prior and bias, and contextual settings with time-varying priors and utilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies repeated persuasion where a sender designs signals for a receiver who updates beliefs as a convex combination of the prior and Bayesian posterior, governed by a fixed unknown bias parameter. The sender observes only realized actions, chooses persuasive schemes each round, and minimizes regret relative to a full-information oracle knowing the bias. For general finite state/action spaces and arbitrary bounded utilities, a safe exploration algorithm is proposed that exploits asymmetric probing costs to achieve O(log log T) regret, with a matching Omega(log log T) lower bound. Extensions to unknown priors, contextual settings, and receiver welfare are discussed.
Significance. If the analysis holds, the result is significant for online learning in information design: it establishes an optimal double-logarithmic regret rate under model misspecification on the receiver side, enabled by conservative probes that incur only local loss. The matching lower bound and the general finite-space setting with arbitrary utilities are strengths. The paper also provides concrete algorithmic design and discusses welfare implications, which could inform future work on strategic communication and mechanism design with learning.
major comments (2)
- [regret analysis of safe exploration algorithm] Regret analysis for the safe exploration procedure: the O(log log T) upper bound requires that conservative probes refine the bias estimate at double-exponential speed while remaining persuasive after each update. For arbitrary bounded utilities the receiver's best response can be discontinuous in the perceived posterior; it is unclear whether the probe construction guarantees that an updated estimate keeps the chosen signal persuasive and informative, which is load-bearing for obtaining the claimed rate rather than a slower polynomial rate.
- [lower bound section] Lower bound construction: the Omega(log log T) lower bound relies on the necessity of safe exploration and that discontinuities in best responses do not permit faster learning. A verification that the construction carries through for general utilities (where discontinuities might alter the information-gathering speed) would be needed to confirm optimality.
minor comments (2)
- [abstract] Abstract: the precise mathematical definition of the biased posterior (convex combination parameter) should be stated explicitly rather than described only in words.
- [algorithm description] The manuscript would benefit from a short illustrative example with discontinuous best response to demonstrate that the safe probes remain valid after estimate updates.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our paper. We address each major comment below, providing clarifications on the regret analysis and lower bound. We believe these responses resolve the concerns while maintaining the integrity of our results for general settings.
read point-by-point responses
-
Referee: Regret analysis for the safe exploration procedure: the O(log log T) upper bound requires that conservative probes refine the bias estimate at double-exponential speed while remaining persuasive after each update. For arbitrary bounded utilities the receiver's best response can be discontinuous in the perceived posterior; it is unclear whether the probe construction guarantees that an updated estimate keeps the chosen signal persuasive and informative, which is load-bearing for obtaining the claimed rate rather than a slower polynomial rate.
Authors: We appreciate this observation. In Section 4 of the manuscript, the safe exploration algorithm constructs probes that are persuasive for all bias parameters in the current uncertainty interval. This is achieved by solving for signals that induce the desired action for the worst-case bias in the interval, ensuring that updates to the estimate (which refine the interval) keep the signal persuasive because the new interval is a subset. Although best responses may be discontinuous, the persuasion condition is maintained within the regions where the action is fixed, and our algorithm avoids crossing discontinuity points by using conservative probes. The double-exponential convergence is due to the multiplicative reduction in uncertainty per probe. To make this clearer, we will add a remark in the main text and expand the proof in the appendix to explicitly handle the discontinuity cases for arbitrary utilities. revision: partial
-
Referee: Lower bound construction: the Omega(log log T) lower bound relies on the necessity of safe exploration and that discontinuities in best responses do not permit faster learning. A verification that the construction carries through for general utilities (where discontinuities might alter the information-gathering speed) would be needed to confirm optimality.
Authors: For the lower bound in Section 5, the construction uses a specific family of utility functions where the best response discontinuities are present but do not allow bypassing the safe exploration requirement. The information-theoretic argument shows that to achieve low regret, the algorithm must perform probes that safely narrow the bias parameter, and any attempt to use aggressive probes risks high regret due to potential loss of persuasion. This holds for general utilities because the lower bound instance is embedded in the general setting, and discontinuities only increase the difficulty of learning without safe probes. We can provide additional details in the revision to verify the construction explicitly for discontinuous cases. revision: partial
Circularity Check
No significant circularity; regret bound derived independently of fitted bias parameter
full rationale
The paper presents a safe exploration algorithm that learns a single fixed scalar bias parameter through conservative probes while maintaining persuasion. The O(log log T) upper bound and matching lower bound are obtained via standard analysis of asymmetric-cost exploration against an external full-information oracle benchmark. No step reduces the claimed regret expression to a quantity defined by the same bias parameter, nor does the derivation rely on self-citation chains, imported uniqueness theorems, or ansatz smuggling. The central result remains self-contained with independent content for general finite spaces and arbitrary utilities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Receiver's distorted posterior is a convex combination of prior and Bayesian posterior with fixed unknown weight
- domain assumption Sender observes only the realized action after each signal
Reference graph
Works this paper leans on
-
[1]
Robert J Aumann, Michael Maschler, and Richard E Stearns.Repeated games with incomplete information. MIT press, 1995
work page 1995
-
[2]
Over-and underreaction to information: Belief updating with cognitive constraints
Cuimin Ba, J Aislinn Bohren, and Alex Imas. Over-and underreaction to information: Belief updating with cognitive constraints. Technical report, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, 2025
work page 2025
-
[3]
Markov persuasion processes: Learning to persuade from scratch.arXiv preprint arXiv:2402.03077, 2024
Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Markov persuasion processes: Learning to persuade from scratch.arXiv preprint arXiv:2402.03077, 2024
-
[4]
Phoebe E Bailey, Tarren Leon, Natalie C Ebner, Ahmed A Moustafa, and Gabrielle Weidemann. A meta-analysis of the weight of advice in decision-making.Current Psychology, 42(28): 24516–24541, 2023
work page 2023
-
[5]
The base-rate fallacy in probability judgments.Acta Psychologica, 44(3): 211–233, 1980
Maya Bar-Hillel. The base-rate fallacy in probability judgments.Acta Psychologica, 44(3): 211–233, 1980
work page 1980
-
[6]
Daniel J Benjamin. Errors in probabilistic reasoning and judgment biases.Handbook of Behavioral Economics: Applications and Foundations 1, 2:69–186, 2019
work page 2019
-
[7]
Dirk Bergemann and Stephen Morris. Information design, bayesian persuasion, and bayes correlated equilibrium.American Economic Review, 106(5):586–591, 2016
work page 2016
-
[8]
Equivalent comparisons of experiments.The annals of mathematical statistics, pages 265–272, 1953
David Blackwell. Equivalent comparisons of experiments.The annals of mathematical statistics, pages 265–272, 1953
work page 1953
-
[9]
Online bayesian persua- sion.Advances in neural information processing systems, 33:16188–16198, 2020
Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persua- sion.Advances in neural information processing systems, 33:16188–16198, 2020
work page 2020
-
[10]
Multi-receiver online bayesian persuasion
Matteo Castiglioni, Alberto Marchesi, Andrea Celli, and Nicola Gatti. Multi-receiver online bayesian persuasion. InInternational Conference on Machine Learning, pages 1314–1323. PMLR, 2021
work page 2021
-
[11]
Bias detection via signaling.Advances in Neural Information Processing Systems, 37:69120–69143, 2024
Yiling Chen, Tao Lin, Ariel D Procaccia, Aaditya Ramdas, and Itai Shapira. Bias detection via signaling.Advances in Neural Information Processing Systems, 37:69120–69143, 2024
work page 2024
-
[12]
Learning to signal in the goldilocks zone: Improving adversary compliance in security games
Sarah Cooney, Kai Wang, Elizabeth Bondi, Thanh Nguyen, Phebe Vayanos, Hailey Winetrobe, Edward A Cranford, Cleotilde Gonzalez, Christian Lebiere, and Milind Tambe. Learning to signal in the goldilocks zone: Improving adversary compliance in security games. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 725–740....
work page 2019
-
[13]
Edward A Cranford, Cleotilde Gonzalez, Palvi Aggarwal, Sarah Cooney, Milind Tambe, and Christian Lebiere. Toward personalized deceptive signaling for cyber defense using cognitive models.Topics in Cognitive Science, 12(3):992–1011, 2020
work page 2020
-
[14]
A theory of learning to infer.Psychological review, 127(3):412, 2020
Ishita Dasgupta, Eric Schulz, Joshua B Tenenbaum, and Samuel J Gershman. A theory of learning to infer.Psychological review, 127(3):412, 2020
work page 2020
-
[15]
Non-bayesian persuasion.Journal of Political Economy, 130(10):2594–2642, 2022
Geoffroy De Clippel and Xu Zhang. Non-bayesian persuasion.Journal of Political Economy, 130(10):2594–2642, 2022
work page 2022
-
[16]
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: people erroneously avoid algorithms after seeing them err.Journal of experimental psychology: General, 144(1):114, 2015
work page 2015
-
[17]
Algorithmic bayesian persuasion
Shaddin Dughmi and Haifeng Xu. Algorithmic bayesian persuasion. InProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 412–425, 2016
work page 2016
-
[18]
Algorithmic persuasion with no externalities
Shaddin Dughmi and Haifeng Xu. Algorithmic persuasion with no externalities. InProceedings of the 2017 ACM Conference on Economics and Computation, pages 351–368, 2017. 11
work page 2017
-
[19]
Conservatism in human information processing.Formal representation of human judgment, 1968
Ward Edwards. Conservatism in human information processing.Formal representation of human judgment, 1968
work page 1968
-
[20]
Non-bayesian learning.The BE Journal of Theoretical Economics, 10(1):1–20, 2010
Larry G Epstein, Jawwad Noor, Alvaro Sandroni, et al. Non-bayesian learning.The BE Journal of Theoretical Economics, 10(1):1–20, 2010
work page 2010
-
[21]
Online bayesian recommendation with no regret
Yiding Feng, Wei Tang, and Haifeng Xu. Online bayesian recommendation with no regret. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 818–819, 2022
work page 2022
-
[22]
Rationality-robust information design: Bayesian persuasion under quantal response
Yiding Feng, Chien-Ju Ho, and Wei Tang. Rationality-robust information design: Bayesian persuasion under quantal response. InProceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 501–546. SIAM, 2024
work page 2024
-
[23]
A rothschild-stiglitz approach to bayesian persuasion
Matthew Gentzkow and Emir Kamenica. A rothschild-stiglitz approach to bayesian persuasion. American Economic Review, 106(5):597–601, 2016
work page 2016
-
[24]
David M Grether. Bayes rule as a descriptive model: The representativeness heuristic.The Quarterly journal of economics, 95(3):537–557, 1980
work page 1980
-
[25]
Persuasion with motivated beliefs
David Hagmann and George Loewenstein. Persuasion with motivated beliefs. InOpinion Dynamics & Collective Decisions Workshop, 2017
work page 2017
-
[26]
Algorithmic persuasion through simulation.arXiv preprint arXiv:2311.18138, 2023
Keegan Harris, Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Algorithmic persuasion through simulation.arXiv preprint arXiv:2311.18138, 2023
-
[27]
Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011
Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.American Economic Review, 101(6):2590–2615, 2011
work page 2011
-
[28]
The value of knowing a demand curve: Bounds on regret for online posted-price auctions
Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 594–605. IEEE, 2003
work page 2003
-
[29]
Dynamic Non-Bayesian Persuasion
Masanori Kobayashi. Dynamic non-bayesian persuasion.arXiv preprint arXiv:2508.12328, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Information design with unknown prior.arXiv preprint arXiv:2410.05533, 2024
Ce Li and Tao Lin. Information design with unknown prior.arXiv preprint arXiv:2410.05533, 2024
-
[31]
What influences algorithmic decision-making? a systematic literature review on algorithm aversion
Hasan Mahmud, AKM Najmul Islam, Syed Ishtiaque Ahmed, and Kari Smolander. What influences algorithmic decision-making? a systematic literature review on algorithm aversion. Technological forecasting and social change, 175:121390, 2022
work page 2022
-
[32]
Optimal information disclosure.Journal of political Economy, 118 (5):949–987, 2010
Luis Rayo and Ilya Segal. Optimal information disclosure.Journal of political Economy, 118 (5):949–987, 2010
work page 2010
-
[33]
Motivated skepticism in the evaluation of political beliefs
Charles S Taber and Milton Lodge. Motivated skepticism in the evaluation of political beliefs. American journal of political science, 50(3):755–769, 2006
work page 2006
-
[34]
On the bayesian rational assumption in information design
Wei Tang and Chien-Ju Ho. On the bayesian rational assumption in information design. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 9, pages 120–130, 2021
work page 2021
-
[35]
Bayesian or biased? analytic thinking and political belief updating.Cognition, 204:104375, 2020
Ben M Tappin, Gordon Pennycook, and David G Rand. Bayesian or biased? analytic thinking and political belief updating.Cognition, 204:104375, 2020
work page 2020
-
[36]
Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning.arXiv preprint arXiv:2202.10678, 2022
-
[37]
Probabilistic computations: Toward a unified measure of complexity
Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE Computer Society, 1977
work page 1977
-
[38]
You Zu, Krishnamurthy Iyer, and Haifeng Xu. Learning to persuade on the fly: Robustness against ignorance. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 927–928, 2021. 12 A Related Work (Detailed Discussion) A large body of empirical and theoretical work shows that Bayesian updating is a useful normative benchmark but often ...
work page 2021
-
[39]
Each physical round contributes at most ∆Umax regret
Therefore E[Xk]≤ √ 2 δµ0 ,E " NX k=1 Xk # ≤ √ 2N δµ0 =O(log logT). Each physical round contributes at most ∆Umax regret. The hidden constant here is √ 2∆Umax/δµ0. C.3 Relevant Actions Recall the relevant-action set Arel = n a∈A:∃ν∈∆(Ω)thatauniquely maximizes X ω∈Ω (α⋆ν(ω)+(1−α ⋆)µ0(ω))uR(a, ω) o . The following discussion shows that without excluding thos...
-
[40]
28 Also, bλ+ ≤ 2∥r∥2 δµ0 ≤ 2pϵ δµ0
=µ 0. 28 Also, bλ+ ≤ 2∥r∥2 δµ0 ≤ 2pϵ δµ0 . For utility, write va(ν) := X ω∈Ω ν(ω)u S(a, ω). Then |va(ν)−v a(ν′)| ≤U max p |Ω| ∥ν−ν ′∥2. Therefore, |US(τ)−U S(bτ)| ≤Umax p |Ω| X ν′ i̸=νi λi∥νi −ν ′ i∥2 +U max X i |λi −bλi|+U maxbλ+ ≤U max p |Ω|pϵ+ 2U maxbλ+ ≤U max p |Ω|pϵ+ 4Umax δµ0 pϵ. This proves the claim. C.4.3 Proof of Proposition 3 Since τopt ={(λ i,...
-
[41]
The true bias is detectable in every context:α ⋆ ≥¯αmin := maxx∈X αmin(Ix)
-
[42]
The sender utilities are uniformly bounded:U X max := maxx∈X maxa,ω |uS,x(a, ω)|<∞
-
[43]
For every contextx, the analogue of Proposition 5 holds, and there exists a positive constant c= min x cx >0, ϵ= min x ϵx >0 for the first branch in Proposition 5, and a positive constantδ= min x δx >0for the second branch
-
[44]
The constants in Theorem 2 can be chosen uniformly over x∈ X , see more details in Section C.1. Equivalently, there exist positive constant δµ0 = min x∈X δµ0,x and finite constantsκ, G max, Lb satisfying κ= max x∈X κx, G max = max x∈X Gmax,x, L b = max x∈X Lb,x. Note that the above assumption naturally holds if the context setXis finite. Contextual algori...
-
[45]
2:Initialize:interval[a, b]←[0,1], step sizeε← 1 2, timet←1
the receiver takes action 1 if and only if the distorted posterior is at leastbq, where 0< µ ∗ < bq <1; 38 Algorithm 9Safe Exploration for Unknown Bias and Unknown Prior 1:Input:horizonT, thresholdbq. 2:Initialize:interval[a, b]←[0,1], step sizeε← 1 2, timet←1. 3:Stage 1: Safe exploration 4:whileb−a > T −1 andt≤Tdo 5:m←a,m prev ←a 6:whilem≤bandt≤Tdo 7:Com...
-
[46]
persuasion is feasible, namely α∗ ≥α min = bq−µ∗ 1−µ ∗ so thatm ∗ ∈[0,1]is well-defined
-
[47]
ties are broken in favor of action1. LetΠ SEJ denote Algorithm 9. Then, for everyT≥4, RegT ΠSEJ;α ∗, µ∗ =O(log logT). In particular, RegT ΠSEJ;α ∗, µ∗ ≤ 1−µ ∗ µ∗ + 1 (2 +⌈log 2 log2 T⌉) + 1. Proof. Let V ∗ denote the full-information benchmark value. As noted above, when (α⋆, µ∗) is known, an optimal scheme is supported on{0, ν ∗}, soV ∗ = µ∗ ν∗ =µ ∗ + (1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.