Reasoning about Hypothetical Agent Behaviours and their Parameters

Peter Stone; Stefano V. Albrecht

arxiv: 1906.11064 · v1 · pith:3RZYNJINnew · submitted 2019-06-26 · 💻 cs.MA

Reasoning about Hypothetical Agent Behaviours and their Parameters

Stefano V. Albrecht , Peter Stone This is my paper

Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3

classification 💻 cs.MA

keywords hypothetical agent behaviourstype-based reasoningparameter estimationmulti-agent interactionbelief maintenancecontinuous parameters

0 comments

The pith

Agents can maintain separate estimates for bounded continuous parameters inside each hypothetical behavior type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous approaches to agent interaction with unknown others treat hypothetical behaviors, or types, as black-box mappings from histories to action distributions and therefore ignore any internal parameters. This work proposes a method that keeps an individual running estimate of every bounded continuous parameter belonging to each type and updates the estimates for a chosen subset of types after every new observation. Several concrete rules for choosing which types to update and how to revise the estimates are defined and tested. Experiments show that restricting the update to the parameters of only a single type per observation is frequently enough to reach strong performance.

Core claim

The proposed general method allows an agent to reason about both the relative likelihood of types and the values of any bounded continuous parameters within types by maintaining individual parameter estimates for each type and selectively updating the estimates for some types after each observation.

What carries the argument

Per-type maintenance and selective updating of bounded continuous parameter estimates, performed independently for each hypothetical type after new observations.

If this is right

Agents can now treat type specifications that contain continuous parameters without discarding the parameter information.
Updating parameter estimates for only one type after each observation can still produce good interaction performance.
Several different rules for selecting which types to update and how to revise their estimates remain viable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may lower computational cost in settings where full joint inference over types and parameters would otherwise be required.
It could be tested in physical robot domains where parameters such as speed or sensor noise are bounded but initially unknown.

Load-bearing premise

Type specifications contain identifiable bounded continuous parameters whose values can be estimated separately for each type without joint inference over the full space or access to the true generative model.

What would settle it

An experiment in which restricting updates to a single type's parameters after each observation produces measurably worse performance than methods that perform joint type-parameter inference.

Figures

Figures reproduced from arXiv: 1906.11064 by Peter Stone, Stefano V. Albrecht.

**Figure 3.** Figure 3: Time steps required in completed instances (means [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Average seconds (log-scale) needed per parameter [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Mean error in parameter estimates for the true type [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Average belief P(θ ∗ j |Ht i ) for the true type θ ∗ j in the 10x10 world (updating all types in each time step). Probabilities are averaged over 500 instances and shown for the first 10 and last time steps of an instance. 6. DISCUSSION 6.1 A Note on Belief Merging A central feature of keeping beliefs over a set of behaviours is a property called belief merging [23]. Under a condition of “absolute continu… view at source ↗

read the original abstract

Agents can achieve effective interaction with previously unknown other agents by maintaining beliefs over a set of hypothetical behaviours, or types, that these agents may have. A current limitation in this method is that it does not recognise parameters within type specifications, because types are viewed as blackbox mappings from interaction histories to probability distributions over actions. In this work, we propose a general method which allows an agent to reason about both the relative likelihood of types and the values of any bounded continuous parameters within types. The method maintains individual parameter estimates for each type and selectively updates the estimates for some types after each observation. We propose different methods for the selection of types and the estimation of parameter values. The proposed methods are evaluated in detailed experiments, showing that updating the parameter estimates of a single type after each observation can be sufficient to achieve good performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clean incremental fix for handling bounded continuous parameters inside agent types without forcing joint inference over everything.

read the letter

The core advance is a selective-update scheme that maintains separate parameter estimates per type and only refreshes some of them after each observation. Experiments indicate that updating a single type per step is often enough for solid performance. That directly tackles the black-box limitation the abstract flags in earlier type-based methods, where parameters were simply ignored because types were treated as fixed mappings from histories to action distributions. The work stays scoped to bounded continuous parameters that are identifiable within given type specs, which keeps the claim manageable. No load-bearing circularity shows up in the description, and the stress-test note finds no internal inconsistency. The main soft spot is that the abstract alone leaves the exact selection and estimation procedures underspecified, so the practical robustness depends on details that need the full text and code to judge. Still, the central claim holds up on its own terms. This is for people already working on opponent modeling or type-based reasoning in multi-agent systems; it is a useful but narrow extension rather than a broad shift. A serious editor should send it to referees because the idea is well-motivated, the experiments are reported as positive, and the limitation it addresses is real in the cited prior literature.

Referee Report

0 major / 3 minor

Summary. The paper proposes a general method allowing agents to reason about hypothetical behaviors (types) of other agents, including any bounded continuous parameters within those type specifications. Types are no longer treated as black-box mappings; instead, the method maintains separate parameter estimates for each type and selectively updates a subset of them after each observation. Several selection and estimation procedures are introduced and evaluated experimentally, with the key finding that updating parameters for only a single type per observation is often sufficient to achieve good performance.

Significance. If the experimental results hold under the stated conditions, the work removes a practical limitation of type-based opponent modeling by enabling parameter inference inside types without requiring joint inference over the full type-parameter space. The demonstration that single-type updates suffice is a concrete, falsifiable contribution that could improve scalability in multi-agent interaction settings.

minor comments (3)

[Abstract / §1] The abstract and introduction would benefit from a short, explicit statement of the boundedness assumption on parameters and how it is enforced in the estimation procedures (e.g., projection or truncation).
[§3] Notation for the per-type parameter estimates (e.g., θ_i for type i) should be introduced once and used consistently; several passages appear to switch between “parameter vector” and “parameter value” without clarification.
[§5] The experimental section would be strengthened by reporting the number of independent runs and any statistical tests used to support the claim that single-type updating “achieves good performance.”

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly identifies the core contribution of maintaining per-type parameter estimates with selective updates. No major comments were enumerated in the report, so we have no specific points to address point-by-point. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a practical method for per-type parameter estimation and selective updating, evaluated via experiments. No equations, derivations, or load-bearing steps are present in the abstract or described claims that reduce by construction to fitted inputs, self-citations, or ansatzes. The central claim remains independent and scoped to bounded parameters within supplied type specifications, with no reduction to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes types can be decomposed into discrete identity plus continuous parameters.

pith-pipeline@v0.9.0 · 5661 in / 985 out tokens · 20341 ms · 2026-05-25T15:05:37.768533+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Albrecht, J

S. Albrecht, J. Crandall, and S. Ramamoorthy. Belief and truth in hypothesised behaviours. Artiﬁcial Intelligence, 235:63–94, 2016

work page 2016
[2]

Albrecht, S

S. Albrecht, S. Liemhetcharat, and P. Stone. Special issue on multiagent interaction without prior coordination: Guest editorial. Autonomous Agents and Multi-Agent Systems , 2016

work page 2016
[3]

Albrecht and S

S. Albrecht and S. Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. Technical report, School of Informatics, The University of Edinburgh, 2013

work page 2013
[4]

Albrecht and S

S. Albrecht and S. Ramamoorthy. On convergence and optimality of best-response learning with policy types in multiagent systems. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence , pages 12–21, 2014

work page 2014
[5]

Albrecht and S

S. Albrecht and S. Ramamoorthy. Exploiting causality for selective belief ﬁltering in dynamic Bayesian networks. Journal of Artiﬁcial Intelligence Research , 55:1135–1178, 2016

work page 2016
[6]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002

work page 2002
[7]

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Symposium on the Foundations of Computer Science , pages 322–331, 1995

work page 1995
[8]

Barrett and P

S. Barrett and P. Stone. Cooperating with unknown teammates in complex domains: a robot soccer case study of ad hoc teamwork. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence , pages 2010–2016, 2015

work page 2010
[9]

Barrett, P

S. Barrett, P. Stone, and S. Kraus. Empirical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems , pages 567–574, 2011

work page 2011
[10]

Barrett, P

S. Barrett, P. Stone, S. Kraus, and A. Rosenfeld. Teamwork with limited knowledge of teammates. In Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence , pages 102–108, 2013

work page 2013
[11]

Bowling and P

M. Bowling and P. McCracken. Coordination and adaptation in impromptu teams. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence , pages 53–58, 2005

work page 2005
[12]

Boyen and D

X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proceedings of the 14th Conference on Uncertainty in Artiﬁcial Intelligence , pages 33–42, 1998

work page 1998
[13]

Carmel and S

D. Carmel and S. Markovitch. Learning models of intelligent agents. In Proceedings of the 13th National Conference on Artiﬁcial Intelligence, pages 62–67, 1996

work page 1996
[14]

Carmel and S

D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multi-agent systems. Autonomous Agents and Multi-Agent Systems , 2(2):141–172, 1999

work page 1999
[15]

Chalkiadakis and C

G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: a Bayesian approach. In Proceedings of the 2nd International Conference on Autonomous Agents and Multiagent Systems , pages 709–716, 2003

work page 2003
[16]

Chandrasekaran, P

M. Chandrasekaran, P. Doshi, Y. Zeng, and Y. Chen. Team behavior in interactive dynamic inﬂuence diagrams with applications to ad hoc teams. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, pages 1559–1560, 2014

work page 2014
[17]

de Farias and N

D. de Farias and N. Megiddo. Exploration-exploitation tradeoﬀs for experts algorithms in reactive environments. In Advances in Neural Information Processing Systems 17 , pages 409–416, 2004

work page 2004
[18]

Doshi and P

P. Doshi and P. Gmytrasiewicz. On the diﬃculty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence, pages 1131–1136, 2006

work page 2006
[19]

B. Fu. Multivariate polynomial integration and diﬀerentiation are polynomial time inapproximable unless P = NP. In Lecture Notes in Computer Science , volume 7285, pages 182–191. Springer, 2012

work page 2012
[20]

Gmytrasiewicz and P

P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. Journal of Artiﬁcial Intelligence Research, 24(1):49–79, 2005

work page 2005
[21]

P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. In IEEE Transactions on Systems Science and Cybernetics , volume 4, pages 100–107, July 1968

work page 1968
[22]

Horst, P

R. Horst, P. Pardalos, and N. Thoai. Introduction to Global Optimization. Kluwer Academic Publishers, 2000

work page 2000
[23]

Kalai and E

E. Kalai and E. Lehrer. Rational learning leads to Nash equilibrium. Econometrica, 61(5):1019–1045, 1993

work page 1993
[24]

Kalai and E

E. Kalai and E. Lehrer. Weak and strong merging of opinions. Journal of Mathematical Economics , 23:73–86, 1994

work page 1994
[25]

Karandikar, D

R. Karandikar, D. Mookherjee, D. Ray, and F. Vega-Redondo. Evolving aspirations and cooperation. Journal of Economic Theory , 80(2):292–331, 1998

work page 1998
[26]

Kocsis and C

L. Kocsis and C. Szepesv´ ari. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, pages 282–293. Springer, 2006

work page 2006
[27]

Ledezma, R

A. Ledezma, R. Aler, A. Sanchis, and D. Borrajo. Predicting opponent actions by observation. In RoboCup 2003: Robot Soccer World Cup VII , pages 286–296. Springer, 2004

work page 2003
[28]

Martinez-Cantin

R. Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinear optimization, experimental design and bandits. Journal of Machine Learning Research , 15:3735–3739, 2014

work page 2014
[29]

J. Mockus. Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2013

work page 2013
[30]

Murphy and Y

K. Murphy and Y. Weiss. The factored frontier algorithm for approximate inference in DBNs. In Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence , pages 378–385, 2001

work page 2001
[31]

J. Nachbar. Beliefs in repeated games. Econometrica, 73(2):459–480, 2005

work page 2005
[32]

Panella and P

A. Panella and P. Gmytrasiewicz. Interactive POMDPs with ﬁnite-state models of other agents. Autonomous Agents and Multi-Agent Systems, 2017

work page 2017
[33]

Rasmussen and C

C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006
[34]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 58:527–535, 1952

work page 1952
[35]

Snoek, H

J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25 , pages 2951–2959, 2012

work page 2012
[36]

Southey, M

F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner. Bayes’ bluﬀ: opponent modelling in poker. In Proceedings of the 21st Conference on Uncertainty in Artiﬁcial Intelligence , pages 550–558, 2005

work page 2005
[37]

Stone, G

P. Stone, G. Kaminka, S. Kraus, and J. Rosenschein. Ad hoc autonomous agent teams: collaboration without pre-coordination. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence , pages 1504–1509, 2010

work page 2010
[38]

Sutton and A

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998

work page 1998
[39]

Thompson

W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933

work page 1933
[40]

Watkins and P

C. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992

work page 1992

[1] [1]

Albrecht, J

S. Albrecht, J. Crandall, and S. Ramamoorthy. Belief and truth in hypothesised behaviours. Artiﬁcial Intelligence, 235:63–94, 2016

work page 2016

[2] [2]

Albrecht, S

S. Albrecht, S. Liemhetcharat, and P. Stone. Special issue on multiagent interaction without prior coordination: Guest editorial. Autonomous Agents and Multi-Agent Systems , 2016

work page 2016

[3] [3]

Albrecht and S

S. Albrecht and S. Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. Technical report, School of Informatics, The University of Edinburgh, 2013

work page 2013

[4] [4]

Albrecht and S

S. Albrecht and S. Ramamoorthy. On convergence and optimality of best-response learning with policy types in multiagent systems. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence , pages 12–21, 2014

work page 2014

[5] [5]

Albrecht and S

S. Albrecht and S. Ramamoorthy. Exploiting causality for selective belief ﬁltering in dynamic Bayesian networks. Journal of Artiﬁcial Intelligence Research , 55:1135–1178, 2016

work page 2016

[6] [6]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002

work page 2002

[7] [7]

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Symposium on the Foundations of Computer Science , pages 322–331, 1995

work page 1995

[8] [8]

Barrett and P

S. Barrett and P. Stone. Cooperating with unknown teammates in complex domains: a robot soccer case study of ad hoc teamwork. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence , pages 2010–2016, 2015

work page 2010

[9] [9]

Barrett, P

S. Barrett, P. Stone, and S. Kraus. Empirical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems , pages 567–574, 2011

work page 2011

[10] [10]

Barrett, P

S. Barrett, P. Stone, S. Kraus, and A. Rosenfeld. Teamwork with limited knowledge of teammates. In Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence , pages 102–108, 2013

work page 2013

[11] [11]

Bowling and P

M. Bowling and P. McCracken. Coordination and adaptation in impromptu teams. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence , pages 53–58, 2005

work page 2005

[12] [12]

Boyen and D

X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proceedings of the 14th Conference on Uncertainty in Artiﬁcial Intelligence , pages 33–42, 1998

work page 1998

[13] [13]

Carmel and S

D. Carmel and S. Markovitch. Learning models of intelligent agents. In Proceedings of the 13th National Conference on Artiﬁcial Intelligence, pages 62–67, 1996

work page 1996

[14] [14]

Carmel and S

D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multi-agent systems. Autonomous Agents and Multi-Agent Systems , 2(2):141–172, 1999

work page 1999

[15] [15]

Chalkiadakis and C

G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: a Bayesian approach. In Proceedings of the 2nd International Conference on Autonomous Agents and Multiagent Systems , pages 709–716, 2003

work page 2003

[16] [16]

Chandrasekaran, P

M. Chandrasekaran, P. Doshi, Y. Zeng, and Y. Chen. Team behavior in interactive dynamic inﬂuence diagrams with applications to ad hoc teams. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, pages 1559–1560, 2014

work page 2014

[17] [17]

de Farias and N

D. de Farias and N. Megiddo. Exploration-exploitation tradeoﬀs for experts algorithms in reactive environments. In Advances in Neural Information Processing Systems 17 , pages 409–416, 2004

work page 2004

[18] [18]

Doshi and P

P. Doshi and P. Gmytrasiewicz. On the diﬃculty of achieving equilibrium in interactive POMDPs. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence, pages 1131–1136, 2006

work page 2006

[19] [19]

B. Fu. Multivariate polynomial integration and diﬀerentiation are polynomial time inapproximable unless P = NP. In Lecture Notes in Computer Science , volume 7285, pages 182–191. Springer, 2012

work page 2012

[20] [20]

Gmytrasiewicz and P

P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. Journal of Artiﬁcial Intelligence Research, 24(1):49–79, 2005

work page 2005

[21] [21]

P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. In IEEE Transactions on Systems Science and Cybernetics , volume 4, pages 100–107, July 1968

work page 1968

[22] [22]

Horst, P

R. Horst, P. Pardalos, and N. Thoai. Introduction to Global Optimization. Kluwer Academic Publishers, 2000

work page 2000

[23] [23]

Kalai and E

E. Kalai and E. Lehrer. Rational learning leads to Nash equilibrium. Econometrica, 61(5):1019–1045, 1993

work page 1993

[24] [24]

Kalai and E

E. Kalai and E. Lehrer. Weak and strong merging of opinions. Journal of Mathematical Economics , 23:73–86, 1994

work page 1994

[25] [25]

Karandikar, D

R. Karandikar, D. Mookherjee, D. Ray, and F. Vega-Redondo. Evolving aspirations and cooperation. Journal of Economic Theory , 80(2):292–331, 1998

work page 1998

[26] [26]

Kocsis and C

L. Kocsis and C. Szepesv´ ari. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, pages 282–293. Springer, 2006

work page 2006

[27] [27]

Ledezma, R

A. Ledezma, R. Aler, A. Sanchis, and D. Borrajo. Predicting opponent actions by observation. In RoboCup 2003: Robot Soccer World Cup VII , pages 286–296. Springer, 2004

work page 2003

[28] [28]

Martinez-Cantin

R. Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinear optimization, experimental design and bandits. Journal of Machine Learning Research , 15:3735–3739, 2014

work page 2014

[29] [29]

J. Mockus. Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2013

work page 2013

[30] [30]

Murphy and Y

K. Murphy and Y. Weiss. The factored frontier algorithm for approximate inference in DBNs. In Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence , pages 378–385, 2001

work page 2001

[31] [31]

J. Nachbar. Beliefs in repeated games. Econometrica, 73(2):459–480, 2005

work page 2005

[32] [32]

Panella and P

A. Panella and P. Gmytrasiewicz. Interactive POMDPs with ﬁnite-state models of other agents. Autonomous Agents and Multi-Agent Systems, 2017

work page 2017

[33] [33]

Rasmussen and C

C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006

[34] [34]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 58:527–535, 1952

work page 1952

[35] [35]

Snoek, H

J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25 , pages 2951–2959, 2012

work page 2012

[36] [36]

Southey, M

F. Southey, M. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and C. Rayner. Bayes’ bluﬀ: opponent modelling in poker. In Proceedings of the 21st Conference on Uncertainty in Artiﬁcial Intelligence , pages 550–558, 2005

work page 2005

[37] [37]

Stone, G

P. Stone, G. Kaminka, S. Kraus, and J. Rosenschein. Ad hoc autonomous agent teams: collaboration without pre-coordination. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence , pages 1504–1509, 2010

work page 2010

[38] [38]

Sutton and A

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998

work page 1998

[39] [39]

Thompson

W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933

work page 1933

[40] [40]

Watkins and P

C. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992

work page 1992