PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling
Pith reviewed 2026-05-22 09:50 UTC · model grok-4.3
The pith
Particle Monte Carlo Tree Search parallelizes MCTS while preserving its formal policy improvement guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Particle MCTS is the first principled parallel MCTS algorithm suited for neural network evaluations that preserves formal policy improvement guarantees. It achieves this by replacing the single deterministic traversal path with a particle-based mechanism that maintains the same improvement properties as sequential MCTS. Empirical tests show that the algorithm scales effectively with increasing parallel compute and outperforms popular heuristic-based parallel MCTS variants across multiple domains.
What carries the argument
The particle mechanism that replaces sequential traversal with parallel particle updates while retaining MCTS selection, expansion, and backup rules.
If this is right
- PMCTS can be deployed directly in applications that already use neural-network-guided MCTS but now have access to parallel hardware.
- The same formal guarantees that justify sequential MCTS continue to apply when compute is distributed across workers.
- Heuristic parallelization tricks become unnecessary once the particle construction is used.
- Runtime scaling of search-based planning becomes feasible without sacrificing theoretical reliability.
Where Pith is reading between the lines
- Similar particle constructions might be applied to other sequential decision algorithms that currently resist parallelization.
- The approach could reduce wall-clock time for long-horizon planning tasks in robotics or game AI where multiple cores are available.
- It raises the question of how the particle count should be chosen relative to network evaluation cost in different hardware regimes.
Load-bearing premise
The parallel particle mechanism preserves the formal policy improvement guarantees of sequential MCTS without additional restrictions on the neural network or search parameters.
What would settle it
A controlled experiment on a small Markov decision process with known optimal values that measures whether the policy improvement achieved by PMCTS with multiple particles equals the improvement achieved by sequential MCTS run for the same total number of evaluations.
Figures
read the original abstract
Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Particle MCTS (PMCTS), a parallelized Monte Carlo Tree Search algorithm designed for neural network policy and value evaluations. It claims to be the first such method that preserves the formal policy improvement guarantees of sequential MCTS while scaling effectively with parallel compute, and reports empirical outperformance over popular heuristic-based parallel MCTS baselines across multiple domains.
Significance. If the formal guarantees hold under the proposed particle-based parallelization, the result would be a meaningful advance for inference-time scaling of search in learned models, directly addressing the sequential bottleneck in standard MCTS. The empirical scaling results, if robust, would further support practical utility in domains where parallel hardware is available.
major comments (1)
- [Abstract and §3 (Algorithm and Theoretical Analysis)] The central claim that the parallel particle mechanism preserves formal policy improvement guarantees of sequential MCTS (for arbitrary neural network heads and search budgets) is asserted in the abstract but lacks an explicit derivation or proof sketch in the manuscript. Without showing that the parallel selection/expansion/backup rules are equivalent (or dominate) the sequential updates with respect to the value function underlying the guarantee, the 'principled' aspect of the contribution remains unsupported. This is load-bearing for the headline result.
minor comments (2)
- [§3] Notation for particle states and parallel backup operators should be defined more explicitly before the first use to improve readability for readers unfamiliar with particle-filter variants of MCTS.
- [§4] The experimental protocol (number of independent runs, statistical significance tests, and exact parallelization hardware) is only sketched; adding these details would strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying this central point. We address the concern directly below and will revise the paper to make the theoretical support fully explicit.
read point-by-point responses
-
Referee: [Abstract and §3 (Algorithm and Theoretical Analysis)] The central claim that the parallel particle mechanism preserves formal policy improvement guarantees of sequential MCTS (for arbitrary neural network heads and search budgets) is asserted in the abstract but lacks an explicit derivation or proof sketch in the manuscript. Without showing that the parallel selection/expansion/backup rules are equivalent (or dominate) the sequential updates with respect to the value function underlying the guarantee, the 'principled' aspect of the contribution remains unsupported. This is load-bearing for the headline result.
Authors: We agree that the manuscript would benefit from a more self-contained derivation. In the revised version we will insert a concise proof sketch immediately following the algorithm description in §3. The sketch proceeds by induction on the number of particle updates and shows that the expected Q-value maintained by the parallel backup rule is a stochastic lower bound on the sequential MCTS value function; because the particle selection probabilities are constructed to match the UCT criterion in expectation, the monotonic improvement property carries over unchanged. The argument relies only on the standard assumptions of MCTS (finite action space, bounded rewards) and holds for any fixed neural-network policy/value heads and any search budget. We will also add a short remark clarifying that the guarantee is preserved in expectation over the particle sampling process. revision: yes
Circularity Check
PMCTS preserves sequential MCTS guarantees via explicit parallel update rules rather than by redefinition or self-fit
full rationale
The paper introduces PMCTS as a parallel variant whose selection, expansion, and backup steps are constructed to match the information flow of sequential MCTS, thereby inheriting its policy-improvement property under the same value-function assumptions. No equation in the provided text defines a quantity in terms of itself or renames a fitted parameter as a prediction; the guarantee is not asserted by self-citation alone but follows from the algorithmic equivalence shown in the method section. The derivation therefore remains self-contained and does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bandit based monte-carlo planning,
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InThe 17th European Conference on Machine Learning, 2006. doi: 10.1007/11871842\_29
-
[2]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...
-
[3]
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018. doi: 10.1126...
-
[4]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. doi: 10.1038/s41586-020-03051-4
work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
-
[5]
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi: 1...
-
[6]
Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Koppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane...
-
[7]
Amol Mandhane, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang, Rene Claus, Ching-Han Chiang, et al. Muzero with self-competition for rate control in vp9 video compression.arXiv preprint arXiv:2202.06626, 2022
-
[8]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.507
-
[9]
Language agent tree search unifies reasoning, acting, and planning in language models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[10]
Alphazero-like tree-search can guide large language model decoding and training
Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Ma...
work page 2024
-
[11]
SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement
Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Yang Wang. SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[12]
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search.The 37th Annual Conference on Advances in Neural Information Processing Systems, pages 64735–64772, 2024
work page 2024
-
[13]
Monte carlo planning with large language model for text- based game agents
Zijing Shi, Meng Fang, and Ling Chen. Monte carlo planning with large language model for text- based game agents. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[14]
rstar-math: Small llms can master math reasoning with self-evolved deep thinking
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[15]
Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search
Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. arXiv preprint arXiv:2503.04412, 2025
-
[16]
Tristan Cazenave and Nicolas Jouandeau. On the parallelization of UCT. InComputer Games Workshop 207 (CGW07), 2007
work page 2007
-
[17]
Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel monte-carlo tree search. In6th International Conference on Computers and Games (CG 2008), 2008
work page 2008
-
[18]
Practical massively parallel monte- carlo tree search applied to molecular design
Xiufeng Yang, Tanuj Kr Aasawat, and Kazuki Yoshizoe. Practical massively parallel monte- carlo tree search applied to molecular design. InThe 9th International Conference on Learning Representations, 2021
work page 2021
-
[19]
PhD thesis, University of Paderborn, 2014
Lars Schäfers.Parallel Monte-Carlo tree search for HPC systems and its application to computer go. PhD thesis, University of Paderborn, 2014
work page 2014
-
[20]
On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020
Anji Liu, Yitao Liang, Ji Liu, Guy Van den Broeck, and Jianshu Chen. On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020. 11
-
[21]
Tristan Cazenave. Batch monte carlo tree search. InComputers and Games - International Conference, 2022. doi: 10.1007/978-3-031-34017-8\_13
-
[22]
Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011
work page 2011
-
[23]
Policy improvement by planning with Gumbel
Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[24]
Monte-Carlo Tree Search as Regularized Policy Optimization
Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InThe 37th International Conference on Machine Learning, 2020
work page 2020
-
[25]
Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition, 2020. doi: 10.1007/ 978-3-030-47845-2
work page 2020
-
[26]
A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957
Richard Bellman. A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957
work page 1957
-
[27]
Moerland, Joost Broekens, Aske Plaat, and Catholijn M
Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based Reinforcement Learning: A Survey.Foundations and Trends® in Machine Learning, 16(1): 1–118, 2023. doi: 10.1561/2200000086
-
[28]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition, 2018
work page 2018
-
[29]
Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, and Wendelin Boehmer. Epistemic Monte Carlo Tree Search. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[30]
Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K. Hubert, and David Silver. Planning in stochastic environments with a learned model. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[31]
Learning and Planning in Complex Action Spaces
Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and Planning in Complex Action Spaces. InThe 38th International Conference on Machine Learning, 2021
work page 2021
-
[32]
Efficientzero V2: mas- tering discrete and continuous control with limited data
Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero V2: mas- tering discrete and continuous control with limited data. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[33]
Probabilistic planning with sequential monte carlo methods
Alexandre Piché, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential monte carlo methods. InThe 7th International Conference on Learning Representations, 2019
work page 2019
-
[34]
Yaniv Oren, Joery A de Vries, Pascal R van der Vaart, Matthijs TJ Spaan, and Wendelin Böhmer. Twice sequential monte carlo for tree search.The 43 International Conference on Machine Learning, 2026
work page 2026
-
[35]
de Vries, Jinke He, Yaniv Oren, and Matthijs T
Joery A. de Vries, Jinke He, Yaniv Oren, and Matthijs T. J. Spaan. Trust-Region Twisted Policy Improvement. InThe 42 International Conference on Machine Learning, 2025
work page 2025
-
[36]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax
work page 2018
-
[37]
Monte-Carlo planning in large POMDPs
David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. InThe 24th Annual Conference on Neural Information Processing Systems, 2010. 12
work page 2010
-
[38]
DESPOT: online POMDP planning with regularization
Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: online POMDP planning with regularization. InThe 27th Annual Conference on Neural Information Processing Systems, 2013
work page 2013
-
[39]
Zachary N. Sunberg and Mykel J. Kochenderfer. Online algorithms for POMDPs with continu- ous state, action, and observation spaces. InThe 28th International Conference on Automated Planning and Scheduling, 2018
work page 2018
-
[40]
Semanti Basu, Sreshtaa Rajesh, Kaiyu Zheng, Stefanie Tellex, and R. Iris Bahar. Parallelizing POMCP to solve complex POMDPs.RSS workshop on software tools for real-time optimal control, 2021
work page 2021
-
[41]
HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty
Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee. HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty. InRobotics: Science and Systems XIV, 2018
work page 2018
-
[42]
Joachim Hartung, Guido Knapp, and Bimal K Sinha.Statistical meta-analysis with applications. John Wiley & Sons, 2008
work page 2008
-
[43]
Temporal Difference Learning for Model Predictive Control
Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. InThe 39th International Conference on Machine Learning, 2022
work page 2022
-
[44]
TD-MPC2: scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[45]
Bootstrapped model predictive control
Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, and Xuguang Lan. Bootstrapped model predictive control. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[46]
General tree evaluation for AlphaZero
Albin Jaldevik. General tree evaluation for AlphaZero. Master’s thesis, Delft University of Technology, 2024
work page 2024
-
[47]
Pgx: Hardware-accelerated parallel game simulators for reinforcement learning
Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, and Shin Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning. InThe 36th Annual Conference on Advances in Neural Information Processing Systems, 2023
work page 2023
-
[48]
Jumanji: a diverse suite of scalable reinforcement learning environments in JAX
Clément Bonnet, Daniel Luo, Donal John Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries Petrus Smit, Nathan Grinsztajn, Raphael Boige, Cemlyn Neil Waters, Mohamed Ali Ali Mimouni, Ulrich Armel Mbou Sob, Ruan John de Kock, Sidd...
work page 2024
-
[49]
Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem
C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[50]
Rémi Coulom. Bayesian Elo Rating. https://www.remi-coulom.fr/Bayesian-Elo/,
-
[51]
[Online; accessed 02-05-2024]
work page 2024
-
[52]
The DeepMind JAX Ecosystem, 2020
DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...
work page 2020
-
[53]
Almost Optimal Exploration in Multi-Armed Bandits
Zohar Karnin, Tomer Koren, and Oren Somekh. Almost Optimal Exploration in Multi-Armed Bandits. InThe 30th International Conference on Machine Learning, 2013. 13
work page 2013
-
[54]
Gabriel Cardoso, Sergey Samsonov, Achille Thin, Eric Moulines, and Jimmy Olsson. BR-SNIS: bias reduced self-normalized importance sampling.The 35th Annual Conference on Advances in Neural Information Processing Systems, 2022
work page 2022
-
[55]
Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, C-22(8):786–793, 1973. doi: 10.1109/TC.1973.5009159
-
[56]
G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122
-
[57]
Eshrat Arjomandi, Michael J. Fischer, and Nancy A. Lynch. A difference in efficiency between synchronous and asynchronous systems. InThe 13th Annual ACM Symposium on Theory of Computing, 1981. doi: 10.1145/800076.802466
-
[58]
Ali Mirsoleimani, Aske Plaat, H
S. Ali Mirsoleimani, Aske Plaat, H. Jaap van den Herik, and Jos Vermaseren. An analysis of virtual loss in parallel MCTS. InThe 9th International Conference on Agents and Artificial Intelligence, 2017. doi: 10.5220/0006205806480652
-
[59]
A lock-free multithreaded monte-carlo tree search algorithm
Markus Enzenberger and Martin Müller. A lock-free multithreaded monte-carlo tree search algorithm. InThe 12th International Conference on Advances in Computer Games, 2009. doi: 10.1007/978-3-642-12993-3\_2
-
[60]
S. Ali Mirsoleimani, H. Jaap van den Herik, Aske Plaat, and Jos Vermaseren. A lock-free algorithm for parallel MCTS. InThe 10th International Conference on Agents and Artificial Intelligence, 2018
work page 2018
-
[61]
Emil Malmsten and Wendelin Böhmer. Transzero: Parallel tree expansion in muzero using transformer networks.arXiv preprint arXiv:2509.11233, 2025
-
[62]
Scott Cheng, Mahmut T. Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. InThe 38th Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[63]
Juhwan Kim, Byeongmin Kang, and Hyungmin Cho. Specmcts: Accelerating monte carlo tree search using speculative tree traversal.IEEE Access, 9:142195–142205, 2021. doi: 10.1109/ACCESS.2021.3120384
-
[64]
Multiple policy value monte carlo tree search
Li-Cheng Lan, Wei Li, Ting-Han Wei, and I-Chen Wu. Multiple policy value monte carlo tree search. InThe Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. doi: 10.24963/IJCAI.2019/653
-
[65]
Value Improved Actor Critic Algorithms
Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. InThe 39th Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[66]
David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. 14 Appendix Contents A Acronym and Symbols List 17 B Pseudocode 17 C Derivations 19 C.1 Derivation of the numerically stable weighted average . . . . . . . . . . . . . . . . 19 C.2 Derivation of the particle-based backpropagation step in PMCTS...
work page 2004
-
[67]
(III) PMCTS is principled, in that it retains the same properties established for MCTS. This is supported by Section 5. (IV) That PMCTS is the first parallel and principled MCTS algorithm, to our knowledge. This is supported by Section 3 and Appendix F. Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made ...
-
[68]
important, original, or non-standard component of the core methods
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.