Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

Idan Lev-Yehudi; Michael Novitsky; Moran Barenboim; Ron Benchetrit; Vadim Indelman

arxiv: 2503.12181 · v4 · pith:B3HWXT5Bnew · submitted 2025-03-15 · 💻 cs.AI · cs.RO

Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

Idan Lev-Yehudi , Michael Novitsky , Moran Barenboim , Ron Benchetrit , Vadim Indelman This is my paper

Pith reviewed 2026-05-23 00:05 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords Monte Carlo Tree SearchContinuous POMDPsGradient-based optimizationImportance samplingParticle belief statesOnline planningAction refinement

0 comments

The pith

AGMCTS combines global tree search with local gradient-based action refinement for continuous (PO)MDPs while keeping value estimates consistent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Action-Gradient Monte Carlo Tree Search to address planning in continuous state, action, and observation spaces. It merges global sampling in a tree with local gradient-based improvements to actions. Three theoretical pieces support this: a gradient theorem for particle beliefs, a multiple importance sampling tree for sample reuse, and use of the Area Formula for gradients. The result is a solver that scales better than pure sampling approaches in complex continuous environments. Sympathetic readers would care because it suggests a path to more efficient online planning for autonomous systems.

Core claim

AGMCTS combines global tree search with local gradient-based action refinement while maintaining consistent value estimates, supported by an action score gradient theorem for particle belief states, the Multiple Importance Sampling Tree, and tractable gradients via the Area Formula.

What carries the argument

The Multiple Importance Sampling Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift.

If this is right

Enables frequent action-branch updates in the search tree using reused samples.
Provides tractable action score gradients for smooth generative models.
Maintains consistent value estimates across action refinements.
Outperforms state-of-the-art sample-based solvers on multiple continuous MDP and POMDP benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sample-reuse mechanism could apply to other sampling-based planners that face frequent action updates.
Gradient refinement may lower sample requirements in higher-dimensional continuous spaces than those tested.
Deployment on physical systems with real sensor noise would test whether the particle-belief gradients remain effective outside simulation.

Load-bearing premise

The Multiple Importance Sampling Tree re-uses prior samples for frequent action-branch updates without introducing estimator drift.

What would settle it

A demonstration of estimator drift when reusing samples under the MIS Tree weighting scheme, or AGMCTS failing to outperform sample-based baselines on the continuous MDP and POMDP benchmarks, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2503.12181 by Idan Lev-Yehudi, Michael Novitsky, Moran Barenboim, Ron Benchetrit, Vadim Indelman.

**Figure 1.** Figure 1: The different update operations in the MIS tree from left to right: (1) Action update, a novel tree operation, in which a new action is connected to previous children states with updated action-value; (2) Action backpropagation; (3) State backpropagation. the value function at each state node. We define it as a weighted mean, based on invariants that the visitation counts satisfy during MCTS: Vˆ (s) ≜ n(s)… view at source ↗

**Figure 2.** Figure 2: 2D-Continuous LightDark POMDP. The agent (red dot) starts randomly on the red circle, localizes near the beacon (green star) and then aims for the goal (small blue circle). The reward function is indicated by the background heatmap. The chosen action’s optimization trajectory of AGMCTS is plotted in darkening shades of blue arrows. In this domain, S = O = R N . The agent starts at a random position on … view at source ↗

**Figure 3.** Figure 3: Performance trends of AGMCTS, DPW, PFT-DPW and POMCPOW in the light-dark, mountain/hill car, and lunar lander domains. The returns are averaged over 1000 seeds, with shaded area showing ±3 standard errors. The x-axis shows the simulations budget per planning session. is given by additive Gaussian noise to the output of a deterministic simulator. The action space is very large, and failure rewards are spars… view at source ↗

**Figure 4.** Figure 4: AGMCTS at 2-Continuous Light-Dark. The agent’s current state is the red dot, the current belief particles in orange, the next state in black, and the next belief particles in gray, the next observation is the green dot. The goal is the blue ring centered at (0, 2.5), b0 is the red ring centered at (0, 0), and the beacon is the green star at (2.5, 0). The reward function for the posterior state is the blue-… view at source ↗

read the original abstract

Online planning in continuous state, action, and observation spaces remains challenging for autonomous systems. While Monte Carlo Tree Search (MCTS) scales effectively via sampling, most continuous (PO)MDP solvers do not exploit gradient-based action optimization. We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift; and (3) tractable action score gradients for smooth generative models using the Area Formula. Empirical results demonstrate that AGMCTS outperforms state-of-the-art sample-based solvers in multiple challenging continuous MDP and POMDP benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGMCTS adds action gradients and an MIS tree to MCTS for continuous POMDPs, but the no-drift claim on sample reuse needs the weighting details to hold.

read the letter

The paper's core move is AGMCTS: it runs standard MCTS over particle beliefs but then refines actions locally with gradients while trying to keep the tree values consistent. The three named pieces are an action-score gradient theorem for particles, the Multiple Importance Sampling Tree that reuses old samples when branches change, and Area Formula gradients for smooth generative models. Those look like the actual new bits rather than routine extensions of existing MCTS or gradient POMDP work. The empirical section claims wins on several continuous MDP and POMDP benchmarks, which is the practical hook for robotics planning. The hybrid global-plus-local idea is reasonable and addresses a real scaling pain point when actions are continuous. The main soft spot is exactly the one the stress test flags. The MIS Tree only stays unbiased if the importance weights perfectly correct for every action-branch update; any mismatch when the particle belief or the refined action shifts the sampling distribution would create drift and break the consistency between tree values and the gradient steps. The abstract states the property but does not show the weighting scheme or the proof, so the full paper has to deliver that without hidden assumptions on the generative model staying fixed. Minor issues include the usual need for error bars and clearer benchmark descriptions, but those are fixable. This is squarely for people who already work on online planning in continuous state-action spaces. It is technically grounded enough and novel enough in its combination to deserve a serious referee rather than a desk reject, even if the MIS weighting turns out to need extra work.

Referee Report

2 major / 2 minor

Summary. The paper proposes Action-Gradient Monte Carlo Tree Search (AGMCTS) for online planning in continuous-state, continuous-action, continuous-observation (PO)MDPs. It combines global tree search with local gradient-based action refinement while preserving consistent value estimates, via three claimed theoretical results: (1) an action-score gradient theorem for particle belief states, (2) a Multiple Importance Sampling (MIS) Tree that re-uses prior samples for frequent action-branch updates without estimator drift, and (3) tractable gradients obtained from the Area Formula for smooth generative models. Empirical results are reported to show outperformance versus prior sample-based solvers on several continuous MDP and POMDP benchmarks.

Significance. If the central consistency claim holds, the work would provide a principled way to inject gradient information into MCTS without sacrificing the unbiasedness properties required for value estimates in non-parametric continuous settings. The combination of global search and local refinement, together with the sample-reuse mechanism, could improve sample efficiency in high-dimensional planning problems where pure sampling or pure gradient methods each have known limitations.

major comments (2)

[Multiple Importance Sampling Tree] The Multiple Importance Sampling Tree section: the central claim that prior samples can be re-used for frequent action-branch updates without introducing estimator drift is load-bearing for the consistency guarantee between tree values and action-score gradients. The manuscript must supply the explicit importance-weighting scheme and a derivation showing that the weights exactly compensate for the change in the action-branch sampling distribution after each gradient refinement step; without this, any alteration in the effective proposal (due to particle belief updates or refined actions) risks bias.
[Action score gradient theorem] Action score gradient theorem section (particle belief states): the theorem is stated to support consistent gradients, yet the manuscript provides no explicit statement of the conditions under which the particle approximation preserves the gradient unbiasedness when the generative model is only approximately known or when the belief support changes.

minor comments (2)

[Empirical results] The abstract and empirical section omit error-bar reporting, number of independent runs, and precise benchmark descriptions; these details are required to assess whether the reported outperformance is robust.
[Tractable gradients via Area Formula] Notation for the Area Formula application should be cross-referenced to the precise statement of the generative model smoothness assumptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the requested details and clarifications.

read point-by-point responses

Referee: [Multiple Importance Sampling Tree] The Multiple Importance Sampling Tree section: the central claim that prior samples can be re-used for frequent action-branch updates without introducing estimator drift is load-bearing for the consistency guarantee between tree values and action-score gradients. The manuscript must supply the explicit importance-weighting scheme and a derivation showing that the weights exactly compensate for the change in the action-branch sampling distribution after each gradient refinement step; without this, any alteration in the effective proposal (due to particle belief updates or refined actions) risks bias.

Authors: We agree that an explicit importance-weighting scheme and derivation are essential to substantiate the unbiasedness claim. The current manuscript describes the MIS Tree mechanism at a high level but does not include the full weighting formulas or the step-by-step proof that the weights compensate for post-refinement distribution shifts. In the revised manuscript we will add a dedicated subsection with the precise importance weights (derived from the ratio of the original and updated action-branch proposals) and the accompanying derivation showing that the combined estimator remains unbiased under the stated assumptions. revision: yes
Referee: [Action score gradient theorem] Action score gradient theorem section (particle belief states): the theorem is stated to support consistent gradients, yet the manuscript provides no explicit statement of the conditions under which the particle approximation preserves the gradient unbiasedness when the generative model is only approximately known or when the belief support changes.

Authors: We acknowledge that the theorem statement would benefit from an explicit list of assumptions. The current version assumes an exact generative model and a fixed particle support for the belief; under these conditions the particle-based gradient estimator is unbiased. In the revision we will add a paragraph stating these assumptions clearly, together with a brief discussion of the bias that can arise when the model is only approximate or when particle support changes (e.g., via resampling), including a reference to standard particle-filter convergence results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on stated theorems without reduction to inputs

full rationale

The provided abstract and context present three theoretical contributions—an action score gradient theorem, the MIS Tree for sample reuse without drift, and tractable gradients via the Area Formula—without any equations, fitted parameters, or self-citations that reduce the claimed results to their own inputs by construction. No self-definitional patterns, fitted-input predictions, or load-bearing self-citations appear in the text. The derivation chain is therefore self-contained against external benchmarks, with the central consistency claim depending on the validity of the weighting scheme rather than tautological redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; all listed contributions rest on standard Monte Carlo sampling assumptions and the existence of smooth generative models.

pith-pipeline@v0.9.0 · 5692 in / 1139 out tokens · 28972 ms · 2026-05-23T00:05:07.332925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2. If the weighting strategy wi satisfies assumption (MIS-I) and (MIS-II), then the MIS tree yields unbiased estimates...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

[1]

Continuous upper confidence trees

Adrien Couëtoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5, pages 433–445. Springer, 2011

work page 2011
[2]

Online algorithms for pomdps with continuous state, action, and observation spaces

Zachary Sunberg and Mykel Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, 2018

work page 2018
[3]

Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds

Beomjoon Kim, Kyungjae Lee, Sungbin Lim, Leslie Kaelbling, and Tomás Lozano-Pérez. Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds. In AAAI Conf. on Artificial Intelligence, volume 34, pages 9916–9924, 2020

work page 2020
[4]

V oronoi progressive widening: efficient online solvers for continuous state, action, and observation pomdps

Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. V oronoi progressive widening: efficient online solvers for continuous state, action, and observation pomdps. In2021 60th IEEE conference on decision and control (CDC), pages 4493–4500. IEEE, 2021

work page 2021
[5]

Adaptive discretization using voronoi trees for continuous pomdps

Marcus Hoerger, Hanna Kurniawati, Dirk Kroese, and Nan Ye. Adaptive discretization using voronoi trees for continuous pomdps. The International Journal of Robotics Research, 43(9): 1283–1298, 2024

work page 2024
[6]

Monte carlo tree search in continuous action spaces with execution uncertainty

Timothy Yee, Viliam Lis`y, Michael H Bowling, and S Kambhampati. Monte carlo tree search in continuous action spaces with execution uncertainty. In IJCAI, pages 690–697, 2016

work page 2016
[7]

Continuous state-action-observation pomdps for trajectory planning with bayesian optimisation

Philippe Morere, Roman Marchant, and Fabio Ramos. Continuous state-action-observation pomdps for trajectory planning with bayesian optimisation. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 8779–8786. IEEE, 2018

work page 2018
[8]

Bayesian optimized monte carlo planning

John Mern, Anil Yildiz, Zachary Sunberg, Tapan Mukerji, and Mykel J Kochenderfer. Bayesian optimized monte carlo planning. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 35, pages 11880–11887, 2021

work page 2021
[9]

Monte-carlo tree search in continuous action spaces with value gradients

Jongmin Lee, Wonseok Jeon, Geon-Hyeong Kim, and Kee-Eung Kim. Monte-carlo tree search in continuous action spaces with value gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4561–4568, 2020

work page 2020
[10]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999
[11]

Trust region policy optimization of pomdps

Kamyar Azizzadenesheli, Yisong Yue, and Animashree Anandkumar. Policy gradient in partially observable environments: Approximation and convergence. arXiv preprint arXiv:1810.07900, 2018

work page arXiv 2018
[12]

A policy gradient method for confounded POMDPs

Mao Hong, Zhengling Qi, and Yanxun Xu. A policy gradient method for confounded POMDPs. In The Twelfth International Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=8BAkNCqpGW. x

work page 2024
[13]

Monte-carlo graph search: the value of merging similar states

Edouard Leurent and Odalric-Ambrym Maillard. Monte-carlo graph search: the value of merging similar states. In Sinno Jialin Pan and Masashi Sugiyama, editors, Asian Conference on Machine Learning (ACML 2020), pages 577 – 592, Bangkok, Thailand, November 18-20 2020

work page 2020
[14]

Previous knowledge utilization in online anytime belief space planning

Michael Novitsky, Moran Barenboim, and Vadim Indelman. Previous knowledge utilization in online anytime belief space planning. arXiv preprint arXiv:2412.13128, 2024

work page arXiv 2024
[15]

Optimally combining sampling techniques for monte carlo rendering

Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 419–428. ACM, 1995

work page 1995
[16]

Farhi and V

E. Farhi and V . Indelman. ix-bsp: Incremental belief space planning. arXiv preprint arXiv:2102.09539, 2021

work page arXiv 2021
[17]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018
[18]

Moss, Anthony Corso, Jef Caers, and Mykel J

Robert J. Moss, Anthony Corso, Jef Caers, and Mykel J. Kochenderfer. BetaZero: Belief- State Planning for Long-Horizon POMDPs using Learned Approximations. In Reinforcement Learning Conference (RLC), 2024

work page 2024
[19]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction . The MIT press, Cambridge, MA, 1998

work page 1998
[20]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998

work page 1998
[21]

Doucet, N

A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods In Practice. Springer-Verlag, New York, 2001

work page 2001
[22]

Thrun, W

S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT press, Cambridge, MA, 2005

work page 2005
[23]

A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425):278–288, 1994

work page 1994
[24]

A survey of monte carlo tree search methods

Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012

work page 2012
[25]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006

work page 2006
[26]

Real analysis: modern techniques and their applications

Gerald B Folland. Real analysis: modern techniques and their applications . John Wiley & Sons, 1999

work page 1999
[27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Sample distribution theory using coarea formula

Luigi Negro. Sample distribution theory using coarea formula. Communications in Statistics- Theory and Methods, pages 1–26, 2022

work page 2022
[29]

Scalable differentiable physics for learning and control

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Scalable differentiable physics for learning and control. In International Conference on Machine Learning, pages 7847–7856. PMLR, 2020. xi

work page 2020
[30]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - a differentiable physics engine for large scale rigid body simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=VdvDlnnjzIN

work page 2021
[31]

Dojo: A differentiable physics engine for robotics, 2022

Taylor Howell, Simon Le Cleac’h, Jan Bruedigam, Zico Kolter, Mac Schwager, and Zachary Manchester. Dojo: A differentiable physics engine for robotics, 2022. URL https://arxiv. org/abs/2203.00806

work page arXiv 2022
[32]

Pomdps.jl: A framework for sequential decision making under uncertainty

Maxim Egorov, Zachary N Sunberg, Edward Balaban, Tim A Wheeler, Jayesh K Gupta, and Mykel J Kochenderfer. Pomdps.jl: A framework for sequential decision making under uncertainty. The Journal of Machine Learning Research, 18(1):831–835, 2017

work page 2017
[33]

Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients

William Moses and Valentin Churavy. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12472–12485. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/...

work page 2020
[34]

Forward-Mode Automatic Differentiation in Julia

J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in Julia, 2016. URL https://arxiv.org/abs/1607.07892

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Universal differential equations for scientific machine learning, 2020

Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning, 2020

work page 2020
[36]

Reinforcement learning with replacing eligibility traces

Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22:123–158, 1996

work page 1996
[37]

Tree-based batch mode reinforcement learning

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005

work page 2005
[38]

Differentiation under the integral sign with weak derivatives, 2006

Steve Cheng. Differentiation under the integral sign with weak derivatives, 2006. URL https://planetmath.org/differentiationundertheintegralsign. [Online]

work page 2006
[39]

Optimality guarantees for particle belief approximation of pomdps

Michael H Lim, Tyler J Becker, Mykel J Kochenderfer, Claire J Tomlin, and Zachary N Sunberg. Optimality guarantees for particle belief approximation of pomdps. Journal of Artificial Intelligence Research, 77:1591–1636, 2023

work page 2023
[40]

The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004

work page 2004
[41]

well-behaved

Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013. xii A Proofs and Derivations A.1 Theorem 1 We first highlight that the required "well-behaved" conditions for the theorem are for changing the order of derivatives and integ...

work page 2013
[42]

If limt→t0 f(x, t) = f(x, t0) for every x, then limt→t0 F (t) = F (t0); in particular, if f(x, ·) is continuous for each x, then F is continuous

Suppose that there exists g ∈ L1(µ) such that |f(x, t)| ≤ g(x) for all x, t. If limt→t0 f(x, t) = f(x, t0) for every x, then limt→t0 F (t) = F (t0); in particular, if f(x, ·) is continuous for each x, then F is continuous

work page
[43]

given tree structure

Suppose that ∂f /∂t exists and there is a g ∈ L1(µ) such that |∂f /∂t(x, t)| ≤ g(x) for all x, t. Then F is differentiable and F ′ = R (∂f /∂t)(x, t) dµ(x). More general versions of the theorem have been proven [38]. As this is not the focus of our work, we leave the formalization of Theorem 1 general. We note that Lebesgue integrability may hold for a ve...

work page
[44]

MDP: n ≜ (n(s′,i)+1)M i=1, (102) PB-MDP: n ≜ (n(¯b′,i)+1)M i=1

Child state visitation counters. MDP: n ≜ (n(s′,i)+1)M i=1, (102) PB-MDP: n ≜ (n(¯b′,i)+1)M i=1. (103)

work page
[45]

MDP: p = (pT (s′,i | s, a))M i=1, (104) PB-MDP: p = (pT (¯b−,i | ¯b, a))M i=1

Target transition likelihoods. MDP: p = (pT (s′,i | s, a))M i=1, (104) PB-MDP: p = (pT (¯b−,i | ¯b, a))M i=1. (105)

work page
[46]

MDP: q = (pT (s′,i | s, aprop(s′,i)))M i=1, (106) PB-MDP: q = (pT (¯b−,i | ¯b, aprop(¯b−,i)))M i=1

Proposal transition likelihoods. MDP: q = (pT (s′,i | s, aprop(s′,i)))M i=1, (106) PB-MDP: q = (pT (¯b−,i | ¯b, aprop(¯b−,i)))M i=1. (107)

work page
[47]

MDP: r = (r(s, a, s′,i))M i=1 (108) PB-MDP: r = (r(¯b, a, ¯b′,i))M i=1

Immediate rewards. MDP: r = (r(s, a, s′,i))M i=1 (108) PB-MDP: r = (r(¯b, a, ¯b′,i))M i=1. (109)

work page
[48]

Future value estimates. MDP: v = ( ˆV (s′,i))M i=1 (110) PB-MDP: v = ( ˆV (¯b′,i))M i=1 (111) From here on, the PB-MDP case is the same up to different notations as the MDP case, and we will show only the latter. Using these notations the immediate reward and future value estimates can be compactly written: η(s, a) = 1T (n ⊙ p ⊘ q), (112) ˆr(s, a) = η(s, ...

work page
[49]

Log likelihoods l(p) and l(q)

work page
[50]

n′ and previous visitation counters n

work page
[51]

The equations for ˆr(s, a) are analogous to ˆVf(s, a), and we will not show their derivation explicitly

v′ and previous future value estimates v. The equations for ˆr(s, a) are analogous to ˆVf(s, a), and we will not show their derivation explicitly. xx Action Backpropagation. Let s′,j be an updated node. Hence, we updated nj to n′ j and vj to v′ j. We update n(s, a) by (98) and perform η′(s, a) = η(s, a) + pj qj n′ j − nj , (115) ˆV ′ f(s, a) = 1 η′(s, a) ...

work page
[52]

Constant hyperparameters were defined for each algorithm, including simulation budget nsims, particle count J, and rollout policy

work page
[53]

A cross-entropy (CE) optimization was performed over the UCB, DPW and step size (where applicable) for each algorithm

work page
[54]

For the best performing hyperparameters of each algorithm, we ran 1000 simula- tions of the algorithm with the same seeds between algorithms, at the multiples [10−1, 10−0.75, 10−0.5, 10−0.25, 1.0] of the simulation budget nsims. The simulation jobs were run on a distributed cluster of 8 machines with different cores and con- figurations, using the Julia p...

work page
[55]

Ryzen 7 3700X 8-Core (16 threads) Processor

work page
[56]

Intel i7-6820HQ 4-Core (8 threads) Processor

work page
[57]

Intel i7-8550U 4-Core (8 threads) Processor

work page
[58]

xxv Table 1: Hyperparameters optimized by CE in the evaluations of each algorithm in the D-Continuous Light-Dark, Mountain Car, Hill Car and Lunar Lander domains

Intel E3-1241 v3 4-Core (8 threads) Processor. xxv Table 1: Hyperparameters optimized by CE in the evaluations of each algorithm in the D-Continuous Light-Dark, Mountain Car, Hill Car and Lunar Lander domains. Parameter Notation Explanation Algorithms Tuning c UCT exploration bonus All CE ka Action prog. widening factor All CE αa Action prog. widening pow...

work page
[59]

Intel E5-1620 v4 4-Core (8 threads) Processor

work page
[60]

Intel i9-7920X 12-Core (24 threads) Processor

work page
[61]

Intel i9-11900K 8-Core (16 threads) Processor

work page
[62]

Intel i9-12900K 12-Core (24 threads) Processor. C.1.1 Simulation Budget Since POMCPOW is a state-trajectory algorithm, in the POMDP scenarios we measured empirically the runtime of PFT-DPW at nsims and J, and set a larger simulation budget for POMCPOW nsims,POMCPOW to match the runtime of PFT-DPW. C.1.2 Rollout Policy In order to save computation time and...

work page
[63]

UCT exploration bonus c

work page
[64]

Action progressive widening parameters ka and αa

work page
[65]

For AGMCTS, we also optimized over the Adam learning rate ηAdam, and the action update distance threshold T min da only in the Light-Dark domains

Observation progressive widening parameters ko and αo. For AGMCTS, we also optimized over the Adam learning rate ηAdam, and the action update distance threshold T min da only in the Light-Dark domains. xxvi (a) Middle of scenario, t = 5. (b) End of scenario, t = 6. Figure 4: AGMCTS at 2-Continuous Light-Dark. The agent’s current state is the red dot, the ...

work page 2048

[1] [1]

Continuous upper confidence trees

Adrien Couëtoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5, pages 433–445. Springer, 2011

work page 2011

[2] [2]

Online algorithms for pomdps with continuous state, action, and observation spaces

Zachary Sunberg and Mykel Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, 2018

work page 2018

[3] [3]

Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds

Beomjoon Kim, Kyungjae Lee, Sungbin Lim, Leslie Kaelbling, and Tomás Lozano-Pérez. Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds. In AAAI Conf. on Artificial Intelligence, volume 34, pages 9916–9924, 2020

work page 2020

[4] [4]

V oronoi progressive widening: efficient online solvers for continuous state, action, and observation pomdps

Michael H Lim, Claire J Tomlin, and Zachary N Sunberg. V oronoi progressive widening: efficient online solvers for continuous state, action, and observation pomdps. In2021 60th IEEE conference on decision and control (CDC), pages 4493–4500. IEEE, 2021

work page 2021

[5] [5]

Adaptive discretization using voronoi trees for continuous pomdps

Marcus Hoerger, Hanna Kurniawati, Dirk Kroese, and Nan Ye. Adaptive discretization using voronoi trees for continuous pomdps. The International Journal of Robotics Research, 43(9): 1283–1298, 2024

work page 2024

[6] [6]

Monte carlo tree search in continuous action spaces with execution uncertainty

Timothy Yee, Viliam Lis`y, Michael H Bowling, and S Kambhampati. Monte carlo tree search in continuous action spaces with execution uncertainty. In IJCAI, pages 690–697, 2016

work page 2016

[7] [7]

Continuous state-action-observation pomdps for trajectory planning with bayesian optimisation

Philippe Morere, Roman Marchant, and Fabio Ramos. Continuous state-action-observation pomdps for trajectory planning with bayesian optimisation. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 8779–8786. IEEE, 2018

work page 2018

[8] [8]

Bayesian optimized monte carlo planning

John Mern, Anil Yildiz, Zachary Sunberg, Tapan Mukerji, and Mykel J Kochenderfer. Bayesian optimized monte carlo planning. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 35, pages 11880–11887, 2021

work page 2021

[9] [9]

Monte-carlo tree search in continuous action spaces with value gradients

Jongmin Lee, Wonseok Jeon, Geon-Hyeong Kim, and Kee-Eung Kim. Monte-carlo tree search in continuous action spaces with value gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4561–4568, 2020

work page 2020

[10] [10]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999

[11] [11]

Trust region policy optimization of pomdps

Kamyar Azizzadenesheli, Yisong Yue, and Animashree Anandkumar. Policy gradient in partially observable environments: Approximation and convergence. arXiv preprint arXiv:1810.07900, 2018

work page arXiv 2018

[12] [12]

A policy gradient method for confounded POMDPs

Mao Hong, Zhengling Qi, and Yanxun Xu. A policy gradient method for confounded POMDPs. In The Twelfth International Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=8BAkNCqpGW. x

work page 2024

[13] [13]

Monte-carlo graph search: the value of merging similar states

Edouard Leurent and Odalric-Ambrym Maillard. Monte-carlo graph search: the value of merging similar states. In Sinno Jialin Pan and Masashi Sugiyama, editors, Asian Conference on Machine Learning (ACML 2020), pages 577 – 592, Bangkok, Thailand, November 18-20 2020

work page 2020

[14] [14]

Previous knowledge utilization in online anytime belief space planning

Michael Novitsky, Moran Barenboim, and Vadim Indelman. Previous knowledge utilization in online anytime belief space planning. arXiv preprint arXiv:2412.13128, 2024

work page arXiv 2024

[15] [15]

Optimally combining sampling techniques for monte carlo rendering

Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 419–428. ACM, 1995

work page 1995

[16] [16]

Farhi and V

E. Farhi and V . Indelman. ix-bsp: Incremental belief space planning. arXiv preprint arXiv:2102.09539, 2021

work page arXiv 2021

[17] [17]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018

[18] [18]

Moss, Anthony Corso, Jef Caers, and Mykel J

Robert J. Moss, Anthony Corso, Jef Caers, and Mykel J. Kochenderfer. BetaZero: Belief- State Planning for Long-Horizon POMDPs using Learned Approximations. In Reinforcement Learning Conference (RLC), 2024

work page 2024

[19] [19]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction . The MIT press, Cambridge, MA, 1998

work page 1998

[20] [20]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998

work page 1998

[21] [21]

Doucet, N

A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods In Practice. Springer-Verlag, New York, 2001

work page 2001

[22] [22]

Thrun, W

S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT press, Cambridge, MA, 2005

work page 2005

[23] [23]

A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425):278–288, 1994

work page 1994

[24] [24]

A survey of monte carlo tree search methods

Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012

work page 2012

[25] [25]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006

work page 2006

[26] [26]

Real analysis: modern techniques and their applications

Gerald B Folland. Real analysis: modern techniques and their applications . John Wiley & Sons, 1999

work page 1999

[27] [27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Sample distribution theory using coarea formula

Luigi Negro. Sample distribution theory using coarea formula. Communications in Statistics- Theory and Methods, pages 1–26, 2022

work page 2022

[29] [29]

Scalable differentiable physics for learning and control

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Scalable differentiable physics for learning and control. In International Conference on Machine Learning, pages 7847–7856. PMLR, 2020. xi

work page 2020

[30] [30]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - a differentiable physics engine for large scale rigid body simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=VdvDlnnjzIN

work page 2021

[31] [31]

Dojo: A differentiable physics engine for robotics, 2022

Taylor Howell, Simon Le Cleac’h, Jan Bruedigam, Zico Kolter, Mac Schwager, and Zachary Manchester. Dojo: A differentiable physics engine for robotics, 2022. URL https://arxiv. org/abs/2203.00806

work page arXiv 2022

[32] [32]

Pomdps.jl: A framework for sequential decision making under uncertainty

Maxim Egorov, Zachary N Sunberg, Edward Balaban, Tim A Wheeler, Jayesh K Gupta, and Mykel J Kochenderfer. Pomdps.jl: A framework for sequential decision making under uncertainty. The Journal of Machine Learning Research, 18(1):831–835, 2017

work page 2017

[33] [33]

Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients

William Moses and Valentin Churavy. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12472–12485. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/...

work page 2020

[34] [34]

Forward-Mode Automatic Differentiation in Julia

J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in Julia, 2016. URL https://arxiv.org/abs/1607.07892

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Universal differential equations for scientific machine learning, 2020

Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning, 2020

work page 2020

[36] [36]

Reinforcement learning with replacing eligibility traces

Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22:123–158, 1996

work page 1996

[37] [37]

Tree-based batch mode reinforcement learning

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005

work page 2005

[38] [38]

Differentiation under the integral sign with weak derivatives, 2006

Steve Cheng. Differentiation under the integral sign with weak derivatives, 2006. URL https://planetmath.org/differentiationundertheintegralsign. [Online]

work page 2006

[39] [39]

Optimality guarantees for particle belief approximation of pomdps

Michael H Lim, Tyler J Becker, Mykel J Kochenderfer, Claire J Tomlin, and Zachary N Sunberg. Optimality guarantees for particle belief approximation of pomdps. Journal of Artificial Intelligence Research, 77:1591–1636, 2023

work page 2023

[40] [40]

The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004

work page 2004

[41] [41]

well-behaved

Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013. xii A Proofs and Derivations A.1 Theorem 1 We first highlight that the required "well-behaved" conditions for the theorem are for changing the order of derivatives and integ...

work page 2013

[42] [42]

If limt→t0 f(x, t) = f(x, t0) for every x, then limt→t0 F (t) = F (t0); in particular, if f(x, ·) is continuous for each x, then F is continuous

Suppose that there exists g ∈ L1(µ) such that |f(x, t)| ≤ g(x) for all x, t. If limt→t0 f(x, t) = f(x, t0) for every x, then limt→t0 F (t) = F (t0); in particular, if f(x, ·) is continuous for each x, then F is continuous

work page

[43] [43]

given tree structure

Suppose that ∂f /∂t exists and there is a g ∈ L1(µ) such that |∂f /∂t(x, t)| ≤ g(x) for all x, t. Then F is differentiable and F ′ = R (∂f /∂t)(x, t) dµ(x). More general versions of the theorem have been proven [38]. As this is not the focus of our work, we leave the formalization of Theorem 1 general. We note that Lebesgue integrability may hold for a ve...

work page

[44] [44]

MDP: n ≜ (n(s′,i)+1)M i=1, (102) PB-MDP: n ≜ (n(¯b′,i)+1)M i=1

Child state visitation counters. MDP: n ≜ (n(s′,i)+1)M i=1, (102) PB-MDP: n ≜ (n(¯b′,i)+1)M i=1. (103)

work page

[45] [45]

MDP: p = (pT (s′,i | s, a))M i=1, (104) PB-MDP: p = (pT (¯b−,i | ¯b, a))M i=1

Target transition likelihoods. MDP: p = (pT (s′,i | s, a))M i=1, (104) PB-MDP: p = (pT (¯b−,i | ¯b, a))M i=1. (105)

work page

[46] [46]

MDP: q = (pT (s′,i | s, aprop(s′,i)))M i=1, (106) PB-MDP: q = (pT (¯b−,i | ¯b, aprop(¯b−,i)))M i=1

Proposal transition likelihoods. MDP: q = (pT (s′,i | s, aprop(s′,i)))M i=1, (106) PB-MDP: q = (pT (¯b−,i | ¯b, aprop(¯b−,i)))M i=1. (107)

work page

[47] [47]

MDP: r = (r(s, a, s′,i))M i=1 (108) PB-MDP: r = (r(¯b, a, ¯b′,i))M i=1

Immediate rewards. MDP: r = (r(s, a, s′,i))M i=1 (108) PB-MDP: r = (r(¯b, a, ¯b′,i))M i=1. (109)

work page

[48] [48]

Future value estimates. MDP: v = ( ˆV (s′,i))M i=1 (110) PB-MDP: v = ( ˆV (¯b′,i))M i=1 (111) From here on, the PB-MDP case is the same up to different notations as the MDP case, and we will show only the latter. Using these notations the immediate reward and future value estimates can be compactly written: η(s, a) = 1T (n ⊙ p ⊘ q), (112) ˆr(s, a) = η(s, ...

work page

[49] [49]

Log likelihoods l(p) and l(q)

work page

[50] [50]

n′ and previous visitation counters n

work page

[51] [51]

The equations for ˆr(s, a) are analogous to ˆVf(s, a), and we will not show their derivation explicitly

v′ and previous future value estimates v. The equations for ˆr(s, a) are analogous to ˆVf(s, a), and we will not show their derivation explicitly. xx Action Backpropagation. Let s′,j be an updated node. Hence, we updated nj to n′ j and vj to v′ j. We update n(s, a) by (98) and perform η′(s, a) = η(s, a) + pj qj n′ j − nj , (115) ˆV ′ f(s, a) = 1 η′(s, a) ...

work page

[52] [52]

Constant hyperparameters were defined for each algorithm, including simulation budget nsims, particle count J, and rollout policy

work page

[53] [53]

A cross-entropy (CE) optimization was performed over the UCB, DPW and step size (where applicable) for each algorithm

work page

[54] [54]

For the best performing hyperparameters of each algorithm, we ran 1000 simula- tions of the algorithm with the same seeds between algorithms, at the multiples [10−1, 10−0.75, 10−0.5, 10−0.25, 1.0] of the simulation budget nsims. The simulation jobs were run on a distributed cluster of 8 machines with different cores and con- figurations, using the Julia p...

work page

[55] [55]

Ryzen 7 3700X 8-Core (16 threads) Processor

work page

[56] [56]

Intel i7-6820HQ 4-Core (8 threads) Processor

work page

[57] [57]

Intel i7-8550U 4-Core (8 threads) Processor

work page

[58] [58]

xxv Table 1: Hyperparameters optimized by CE in the evaluations of each algorithm in the D-Continuous Light-Dark, Mountain Car, Hill Car and Lunar Lander domains

Intel E3-1241 v3 4-Core (8 threads) Processor. xxv Table 1: Hyperparameters optimized by CE in the evaluations of each algorithm in the D-Continuous Light-Dark, Mountain Car, Hill Car and Lunar Lander domains. Parameter Notation Explanation Algorithms Tuning c UCT exploration bonus All CE ka Action prog. widening factor All CE αa Action prog. widening pow...

work page

[59] [59]

Intel E5-1620 v4 4-Core (8 threads) Processor

work page

[60] [60]

Intel i9-7920X 12-Core (24 threads) Processor

work page

[61] [61]

Intel i9-11900K 8-Core (16 threads) Processor

work page

[62] [62]

Intel i9-12900K 12-Core (24 threads) Processor. C.1.1 Simulation Budget Since POMCPOW is a state-trajectory algorithm, in the POMDP scenarios we measured empirically the runtime of PFT-DPW at nsims and J, and set a larger simulation budget for POMCPOW nsims,POMCPOW to match the runtime of PFT-DPW. C.1.2 Rollout Policy In order to save computation time and...

work page

[63] [63]

UCT exploration bonus c

work page

[64] [64]

Action progressive widening parameters ka and αa

work page

[65] [65]

For AGMCTS, we also optimized over the Adam learning rate ηAdam, and the action update distance threshold T min da only in the Light-Dark domains

Observation progressive widening parameters ko and αo. For AGMCTS, we also optimized over the Adam learning rate ηAdam, and the action update distance threshold T min da only in the Light-Dark domains. xxvi (a) Middle of scenario, t = 5. (b) End of scenario, t = 6. Figure 4: AGMCTS at 2-Continuous Light-Dark. The agent’s current state is the red dot, the ...

work page 2048