An Empirical Study on the Practical Impact of Prior Beliefs over Policy Types

Jacob W. Crandall; Stefano V. Albrecht; Subramanian Ramamoorthy

arxiv: 1907.05247 · v1 · pith:EQCORJ2Inew · submitted 2019-07-10 · 💻 cs.AI · cs.MA

An Empirical Study on the Practical Impact of Prior Beliefs over Policy Types

Stefano V. Albrecht , Jacob W. Crandall , Subramanian Ramamoorthy This is my paper

Pith reviewed 2026-05-24 23:56 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multiagent learningprior beliefspolicy typesBayesian updatingrepeated interactionsplanning horizonempirical study

0 comments

The pith

Prior beliefs over policy types can significantly impact long-term performance in multiagent learning algorithms, with the effect depending on planning horizon depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies Bayesian methods where an agent maintains beliefs over possible policies that a partner might be using and updates them from observed actions. It shows through experiments that the choice of starting prior belief affects how well the agent performs over many interactions. The size of this effect increases when the agent plans further into the future. Results also indicate that methods to automatically generate priors can produce reliable performance, meaning priors might not need manual setting. This matters for applications where agents must adapt quickly to new partners without extensive tuning.

Core claim

Prior beliefs can have a significant impact on the long-term performance of methods that compute posterior beliefs over a hypothesised set of policies, and the magnitude of the impact depends on the depth of the planning horizon. Automatic methods can be used to compute prior beliefs with consistent performance effects, indicating that prior beliefs could be eliminated as a manual parameter and instead be computed automatically.

What carries the argument

Bayesian posterior updating over a hypothesized set of partner policies using observed actions in repeated interactions.

If this is right

Different prior beliefs lead to different long-term performance levels in multiagent interactions.
The magnitude of performance differences increases with deeper planning horizons.
Automatic methods for computing prior beliefs can achieve consistent performance effects.
Prior beliefs may be replaced by automatic computation rather than manual specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of multiagent systems could test automatic prior methods to reduce the effort of manual tuning.
Similar belief-updating methods in other AI settings might show comparable sensitivity to the choice of prior.
The dependence on priors could be checked across a wider range of interaction lengths and domains.

Load-bearing premise

The hypothesized policies are representative enough of possible partner behaviors that posterior updates based on observed actions remain informative across the tested domains and interaction lengths.

What would settle it

An experiment where performance does not vary significantly across different prior beliefs in the tested domains and interaction lengths, or where automatic prior methods fail to produce consistent effects.

Figures

Figures reproduced from arXiv: 1907.05247 by Jacob W. Crandall, Stefano V. Albrecht, Subramanian Ramamoorthy.

**Figure 1.** Figure 1: Prior beliefs can have significant impact on long-term performance. Plots show average payoffs of player 1 (HBA). X(h)–Y–Z format: HBA used X types and horizon h, player 2 was controlled by Y, and results are averaged over Z games. The matrix A can be fed into a linear program of the form minc c T x s.t. [z, A]x ≤ 0, with n = |Θ∗ j |, c = (1, {0} n ) T , z = ({−1} n ) T , to find a vector x = (l, p1, ..., … view at source ↗

**Figure 2.** Figure 2: Deeper planning horizons can diminish impact of prior beliefs. Results shown for HBA with LFT types, player 2 controlled by FP, averaged over no-conflict games. h is depth of planning horizon (i.e. predicting h next actions of player 2). 5 10 15 20 2.84 2.86 2.88 2.90 2.92 2.94 2.96 2.98 3.00 Average payoff (p1) Time slice (each 5 steps) (a) h = 1 5 10 15 20 2.84 2.86 2.88 2.90 2.92 2.94 2.96 2.98 3.00 Ave… view at source ↗

**Figure 3.** Figure 3: Deeper planning horizons can amplify impact of prior beliefs. Results shown for HBA with CNN types, player 2 controlled by RT, averaged over conflict games. h is depth of planning horizon (i.e. predicting h next actions of player 2). How can deeper planning horizons amplify the impact of prior beliefs? Our data show that whether or not different prior beliefs cause HBA to take different initial actions dep… view at source ↗

**Figure 4.** Figure 4: Automatic prior beliefs have consistent performance effects. Rows show prior beliefs and columns show performance criteria. Each element (r, c) in the matrix corresponds to the percentage of time slices in which the prior belief r produced significantly higher values for the criterion c than the Uniform prior, averaged over all plays in all tested games. All significance statements are based on paired rig… view at source ↗

read the original abstract

Many multiagent applications require an agent to learn quickly how to interact with previously unknown other agents. To address this problem, researchers have studied learning algorithms which compute posterior beliefs over a hypothesised set of policies, based on the observed actions of the other agents. The posterior belief is complemented by the prior belief, which specifies the subjective likelihood of policies before any actions are observed. In this paper, we present the first comprehensive empirical study on the practical impact of prior beliefs over policies in repeated interactions. We show that prior beliefs can have a significant impact on the long-term performance of such methods, and that the magnitude of the impact depends on the depth of the planning horizon. Moreover, our results demonstrate that automatic methods can be used to compute prior beliefs with consistent performance effects. This indicates that prior beliefs could be eliminated as a manual parameter and instead be computed automatically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Empirical evidence that priors matter in Bayesian multi-agent RL and can be automated, but hypothesis set coverage needs scrutiny.

read the letter

The main point is that this paper supplies the first broad empirical data showing prior beliefs over partner policies affect long-term performance in repeated multi-agent interactions, with the size of the effect increasing as the planning horizon lengthens, and that automatic methods for setting those priors produce consistent outcomes. That last part suggests priors could stop being a manual knob. The work is useful because it actually runs the experiments across horizons instead of leaving the parameter unexamined. It gives practitioners concrete evidence on when the choice matters and that automation is feasible without big performance swings. The soft spot is exactly the one the stress-test note raises. The results rest on the hypothesized policy set being representative enough for the posterior updates to stay informative. If the sets were assembled in a narrow way or if the paper never checks what happens when the true partner behavior falls outside them, then the measured sensitivity and the automation claim could be tied to those specific choices rather than a general feature of the Bayesian approach. The abstract gives no information on how the sets were built or whether out-of-set tests were run, so that section will need careful reading. Beyond that, the paper looks straightforward and avoids obvious circularity. It is aimed at people working on Bayesian methods for unknown partners in multi-agent RL. A reader in that niche will get practical data points from it. The empirical contribution is solid enough to deserve referee time even if revisions are needed on the hypothesis-set details.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first comprehensive empirical study on the practical impact of prior beliefs over a hypothesized set of policies in repeated multi-agent interactions. It claims that these priors can significantly affect long-term performance of Bayesian posterior-update methods, with the magnitude depending on planning-horizon depth, and that automatic methods can compute priors yielding consistent performance effects, suggesting priors could be automated rather than manually specified.

Significance. If the central empirical claims hold after addressing validation gaps, the work would be significant for opponent modeling and multi-agent RL: it supplies concrete evidence that prior choice is a load-bearing practical factor rather than a purely theoretical one, and demonstrates a path to removing it as a manual hyperparameter. The reproducible experimental protocol and consistent automatic-prior results are strengths that would support adoption in the field.

major comments (2)

[Experimental setup and hypothesis-set construction] The construction and coverage of the hypothesized policy sets (detailed in the experimental setup) are not validated against out-of-distribution partner behaviors or longer interaction lengths. This is load-bearing for the claim that automatic priors can replace manual ones, because if the true partner policy lies outside the set, posterior updates become uninformative and the observed prior sensitivity may be an artifact of mismatch rather than a general property.
[Main results and horizon-depth analysis] Results on the dependence of prior impact on planning-horizon depth (reported in the main results tables) do not include controls that isolate whether the effect persists when the hypothesis set is expanded or when out-of-set behaviors are injected; without such controls the horizon-depth interaction cannot be confidently attributed to the Bayesian update mechanism itself.

minor comments (2)

[Abstract and §1] The abstract and introduction use 'policy types' and 'hypothesized policies' interchangeably; a single consistent term would improve readability.
[Figures in results section] Figure captions for performance plots should explicitly state the number of independent runs and whether error bars represent standard error or deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental validation. We address each major point below and agree that targeted additions will strengthen the manuscript. We plan to revise accordingly.

read point-by-point responses

Referee: [Experimental setup and hypothesis-set construction] The construction and coverage of the hypothesized policy sets (detailed in the experimental setup) are not validated against out-of-distribution partner behaviors or longer interaction lengths. This is load-bearing for the claim that automatic priors can replace manual ones, because if the true partner policy lies outside the set, posterior updates become uninformative and the observed prior sensitivity may be an artifact of mismatch rather than a general property.

Authors: We agree that robustness to out-of-distribution behaviors is important for the broader claim about automatic priors. The current study evaluates performance when the true policy is drawn from the hypothesized set, which is the standard setting for assessing Bayesian posterior updates over a fixed hypothesis class. To address the concern, the revised manuscript will add experiments injecting out-of-distribution partner policies (e.g., hand-crafted policies outside the set) and longer interaction horizons, reporting whether prior sensitivity and automatic-prior consistency persist under mismatch. revision: yes
Referee: [Main results and horizon-depth analysis] Results on the dependence of prior impact on planning-horizon depth (reported in the main results tables) do not include controls that isolate whether the effect persists when the hypothesis set is expanded or when out-of-set behaviors are injected; without such controls the horizon-depth interaction cannot be confidently attributed to the Bayesian update mechanism itself.

Authors: We acknowledge the value of these controls for isolating the mechanism. The reported horizon-depth interaction is observed under the fixed hypothesis sets used throughout the paper. In revision we will add ablations that (i) expand the hypothesis set size and (ii) inject out-of-set behaviors, then re-evaluate whether the dependence of prior impact on planning depth remains. These results will be reported alongside the original tables. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivation chain

full rationale

The paper is an empirical study reporting experimental results on the impact of prior beliefs in multi-agent policy learning. No equations, derivations, or fitted parameters are presented that reduce by construction to author-defined inputs. Claims rest on observed performance differences across tested domains and horizons, with no self-citation load-bearing the central result or ansatz smuggled via prior work. The representativeness assumption is a standard empirical limitation but does not create circularity in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical study in multi-agent learning and introduces no new free parameters, axioms beyond standard domain assumptions, or invented entities in the provided abstract.

axioms (1)

domain assumption Agents compute posterior beliefs over a hypothesized set of policies based on observed actions of other agents.
This is the core mechanism described in the abstract for the learning algorithms under study.

pith-pipeline@v0.9.0 · 5681 in / 1234 out tokens · 25501 ms · 2026-05-24T23:56:41.815574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Albrecht, S., and Ramamoorthy, S. 2012. Comparative evaluation of MAL algorithms in a diverse set of ad hoc team problems. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, volume 1, 349–356

work page 2012
[2]

Albrecht, S., and Ramamoorthy, S. 2013. A game- theoretic model and best-response learning method for ad hoc coordination in multiagent systems (extended abstract). In Proceedings of the 12th International Conference on Au- tonomous Agents and Multiagent Systems, 1155–1156

work page 2013
[3]

Albrecht, S., and Ramamoorthy, S. 2014. On conver- gence and optimality of best-response learning with policy types in multiagent systems. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence, 12–21

work page 2014
[4]

Albrecht, S.; Crandall, J.; and Ramamoorthy, S. 2015. An empirical study on the practical impact of prior beliefs over policy types – Appendix. http://rad.inf.ed.ac.uk/data/publications/2015/aaai15app.pdf

work page 2015
[5]

Barrett, S.; Stone, P.; and Kraus, S. 2011. Empir- ical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the 10th International Conference on Au- tonomous Agents and Multiagent Systems , volume 2, 567– 574

work page 2011
[6]

Bernardo, J. 1979. Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Soci- ety. Series B (Methodological) 41(2):113–147

work page 1979
[7]

Bowling, M., and McCracken, P. 2005. Coordination and adaptation in impromptu teams. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence, volume 1, 53–58

work page 2005
[8]

Brafman, R., and Tennenholtz, M. 2003. R-max – A general polynomial time algorithm for near-optimal rein- forcement learning. Journal of Machine Learning Research 3:213–231

work page 2003
[9]

Brown, G. 1951. Iterative solution of games by ﬁc- titious play. Activity analysis of production and allocation 13(1):374–376

work page 1951
[10]

Carberry, S. 2001. Techniques for plan recognition. User Modeling and User-Adapted Interaction 11(1-2):31–48

work page 2001
[11]

Carmel, D., and Markovitch, S. 1999. Exploration strategies for model-based learning in multi-agent systems: Exploration strategies. Autonomous Agents and Multi-Agent Systems 2(2):141–172

work page 1999
[12]

Charniak, E., and Goldman, R. 1993. A Bayesian model of plan recognition. Artiﬁcial Intelligence 64(1):53– 79

work page 1993
[13]

Claus, C., and Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artiﬁcial Intelligence, 746–752

work page 1998
[14]

Cox, J.; Shachat, J.; and Walker, M. 2001. An experi- ment to evaluate Bayesian learning of Nash equilibrium play. Games and Economic Behavior 34(1):11–33

work page 2001
[15]

Crandall, J. 2014. Towards minimizing disappoint- ment in repeated games. Journal of Artiﬁcial Intelligence Research 49:111–142

work page 2014
[16]

De Finetti, B. 2008. Philosophical Lectures on Prob- ability: collected, edited, and annotated by Alberto Mura . Springer

work page 2008
[17]

Dekel, E.; Fudenberg, D.; and Levine, D. 2004. Learn- ing to play Bayesian games. Games and Economic Behavior 46(2):282–303

work page 2004
[18]

Gmytrasiewicz, P., and Doshi, P. 2005. A framework for sequential planning in multiagent settings. Journal of Artiﬁcial Intelligence Research 24(1):49–79

work page 2005
[19]

Bayesian

Harsanyi, J. 1967. Games with incomplete informa- tion played by “Bayesian” players. Part I. The basic model. Management Science 14(3):159–182

work page 1967
[20]

Holland, J. 1975. Adaptation in natural and artiﬁ- cial systems: An introductory analysis with applications to biology, control, and artiﬁcial intelligence. The MIT Press

work page 1975
[21]

Jaynes, E. 1968. Prior probabilities. IEEE Transac- tions on Systems Science and Cybernetics 4(3):227–241

work page 1968
[22]

Jordan, J. 1991. Bayesian learning in normal form games. Games and Economic Behavior 3(1):60–81

work page 1991
[23]

Kalai, E., and Lehrer, E. 1993. Rational learning leads to Nash equilibrium. Econometrica 61(5):1019–1045

work page 1993
[24]

Koza, J. 1992. Genetic programming: On the pro- gramming of computers by means of natural selection. The MIT Press

work page 1992
[25]

Rapoport, A., and Guyer, M. 1966. A taxonomy of 2× 2 games. General Systems: Yearbook of the Society for General Systems Research 11:203–214

work page 1966

[1] [1]

Albrecht, S., and Ramamoorthy, S. 2012. Comparative evaluation of MAL algorithms in a diverse set of ad hoc team problems. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, volume 1, 349–356

work page 2012

[2] [2]

Albrecht, S., and Ramamoorthy, S. 2013. A game- theoretic model and best-response learning method for ad hoc coordination in multiagent systems (extended abstract). In Proceedings of the 12th International Conference on Au- tonomous Agents and Multiagent Systems, 1155–1156

work page 2013

[3] [3]

Albrecht, S., and Ramamoorthy, S. 2014. On conver- gence and optimality of best-response learning with policy types in multiagent systems. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence, 12–21

work page 2014

[4] [4]

Albrecht, S.; Crandall, J.; and Ramamoorthy, S. 2015. An empirical study on the practical impact of prior beliefs over policy types – Appendix. http://rad.inf.ed.ac.uk/data/publications/2015/aaai15app.pdf

work page 2015

[5] [5]

Barrett, S.; Stone, P.; and Kraus, S. 2011. Empir- ical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the 10th International Conference on Au- tonomous Agents and Multiagent Systems , volume 2, 567– 574

work page 2011

[6] [6]

Bernardo, J. 1979. Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Soci- ety. Series B (Methodological) 41(2):113–147

work page 1979

[7] [7]

Bowling, M., and McCracken, P. 2005. Coordination and adaptation in impromptu teams. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence, volume 1, 53–58

work page 2005

[8] [8]

Brafman, R., and Tennenholtz, M. 2003. R-max – A general polynomial time algorithm for near-optimal rein- forcement learning. Journal of Machine Learning Research 3:213–231

work page 2003

[9] [9]

Brown, G. 1951. Iterative solution of games by ﬁc- titious play. Activity analysis of production and allocation 13(1):374–376

work page 1951

[10] [10]

Carberry, S. 2001. Techniques for plan recognition. User Modeling and User-Adapted Interaction 11(1-2):31–48

work page 2001

[11] [11]

Carmel, D., and Markovitch, S. 1999. Exploration strategies for model-based learning in multi-agent systems: Exploration strategies. Autonomous Agents and Multi-Agent Systems 2(2):141–172

work page 1999

[12] [12]

Charniak, E., and Goldman, R. 1993. A Bayesian model of plan recognition. Artiﬁcial Intelligence 64(1):53– 79

work page 1993

[13] [13]

Claus, C., and Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artiﬁcial Intelligence, 746–752

work page 1998

[14] [14]

Cox, J.; Shachat, J.; and Walker, M. 2001. An experi- ment to evaluate Bayesian learning of Nash equilibrium play. Games and Economic Behavior 34(1):11–33

work page 2001

[15] [15]

Crandall, J. 2014. Towards minimizing disappoint- ment in repeated games. Journal of Artiﬁcial Intelligence Research 49:111–142

work page 2014

[16] [16]

De Finetti, B. 2008. Philosophical Lectures on Prob- ability: collected, edited, and annotated by Alberto Mura . Springer

work page 2008

[17] [17]

Dekel, E.; Fudenberg, D.; and Levine, D. 2004. Learn- ing to play Bayesian games. Games and Economic Behavior 46(2):282–303

work page 2004

[18] [18]

Gmytrasiewicz, P., and Doshi, P. 2005. A framework for sequential planning in multiagent settings. Journal of Artiﬁcial Intelligence Research 24(1):49–79

work page 2005

[19] [19]

Bayesian

Harsanyi, J. 1967. Games with incomplete informa- tion played by “Bayesian” players. Part I. The basic model. Management Science 14(3):159–182

work page 1967

[20] [20]

Holland, J. 1975. Adaptation in natural and artiﬁ- cial systems: An introductory analysis with applications to biology, control, and artiﬁcial intelligence. The MIT Press

work page 1975

[21] [21]

Jaynes, E. 1968. Prior probabilities. IEEE Transac- tions on Systems Science and Cybernetics 4(3):227–241

work page 1968

[22] [22]

Jordan, J. 1991. Bayesian learning in normal form games. Games and Economic Behavior 3(1):60–81

work page 1991

[23] [23]

Kalai, E., and Lehrer, E. 1993. Rational learning leads to Nash equilibrium. Econometrica 61(5):1019–1045

work page 1993

[24] [24]

Koza, J. 1992. Genetic programming: On the pro- gramming of computers by means of natural selection. The MIT Press

work page 1992

[25] [25]

Rapoport, A., and Guyer, M. 1966. A taxonomy of 2× 2 games. General Systems: Yearbook of the Society for General Systems Research 11:203–214

work page 1966