PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

Hendrik Baier; Joery A. de Vries; Matthijs T. J. Spaan; Viliam Vadocz; Wendelin B\"ohmer; Yaniv Oren

arxiv: 2605.08982 · v2 · pith:BCYXLEARnew · submitted 2026-05-09 · 💻 cs.LG

PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

Yaniv Oren , Viliam Vadocz , Joery A. de Vries , Wendelin B\"ohmer , Matthijs T. J. Spaan , Hendrik Baier This is my paper

Pith reviewed 2026-05-22 09:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords Monte Carlo Tree Searchparallel algorithmspolicy improvementreinforcement learningneural network searchinference scalingparticle methods

0 comments

The pith

Particle Monte Carlo Tree Search parallelizes MCTS while preserving its formal policy improvement guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Particle MCTS to run Monte Carlo Tree Search across multiple processors at once. It does so by treating the search as a collection of particles that keep the core selection and backup steps of standard MCTS intact. A sympathetic reader would care because many practical uses of search, such as real-time planning with neural networks, need more speed but cannot afford to lose the mathematical assurances that sequential MCTS provides. The authors demonstrate that the new method scales with added parallel workers and beats common heuristic parallel baselines on several domains.

Core claim

Particle MCTS is the first principled parallel MCTS algorithm suited for neural network evaluations that preserves formal policy improvement guarantees. It achieves this by replacing the single deterministic traversal path with a particle-based mechanism that maintains the same improvement properties as sequential MCTS. Empirical tests show that the algorithm scales effectively with increasing parallel compute and outperforms popular heuristic-based parallel MCTS variants across multiple domains.

What carries the argument

The particle mechanism that replaces sequential traversal with parallel particle updates while retaining MCTS selection, expansion, and backup rules.

If this is right

PMCTS can be deployed directly in applications that already use neural-network-guided MCTS but now have access to parallel hardware.
The same formal guarantees that justify sequential MCTS continue to apply when compute is distributed across workers.
Heuristic parallelization tricks become unnecessary once the particle construction is used.
Runtime scaling of search-based planning becomes feasible without sacrificing theoretical reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar particle constructions might be applied to other sequential decision algorithms that currently resist parallelization.
The approach could reduce wall-clock time for long-horizon planning tasks in robotics or game AI where multiple cores are available.
It raises the question of how the particle count should be chosen relative to network evaluation cost in different hardware regimes.

Load-bearing premise

The parallel particle mechanism preserves the formal policy improvement guarantees of sequential MCTS without additional restrictions on the neural network or search parameters.

What would settle it

A controlled experiment on a small Markov decision process with known optimal values that measures whether the policy improvement achieved by PMCTS with multiple particles equals the improvement achieved by sequential MCTS run for the same total number of evaluations.

Figures

Figures reproduced from arXiv: 2605.08982 by Hendrik Baier, Joery A. de Vries, Matthijs T. J. Spaan, Viliam Vadocz, Wendelin B\"ohmer, Yaniv Oren.

**Figure 2.** Figure 2: Scaling of parallel MCTS variants with parallel compute (number of particles [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Runtime scaling, Bayes Elo with 95% CI, N = (1, 4, 16, 64) plotted. Center: Runtime scaling, 95% confidence interval across repeated evaluations. Right: Win rate vs. frames during training of AlphaZero with PMCTS and Gumbel MCTS, mean and 95% CI across 3 seeds. In [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablations and hyperparameter evaluation on 9x9 Go ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling of parallel MCTS variants with parallel compute (number of particles [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling of parallel MCTS variants with parallel compute (number of particles [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Action selection ablations across the different baselines, in 9x9 Go ( [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Particle MCTS (PMCTS), a parallelized Monte Carlo Tree Search algorithm designed for neural network policy and value evaluations. It claims to be the first such method that preserves the formal policy improvement guarantees of sequential MCTS while scaling effectively with parallel compute, and reports empirical outperformance over popular heuristic-based parallel MCTS baselines across multiple domains.

Significance. If the formal guarantees hold under the proposed particle-based parallelization, the result would be a meaningful advance for inference-time scaling of search in learned models, directly addressing the sequential bottleneck in standard MCTS. The empirical scaling results, if robust, would further support practical utility in domains where parallel hardware is available.

major comments (1)

[Abstract and §3 (Algorithm and Theoretical Analysis)] The central claim that the parallel particle mechanism preserves formal policy improvement guarantees of sequential MCTS (for arbitrary neural network heads and search budgets) is asserted in the abstract but lacks an explicit derivation or proof sketch in the manuscript. Without showing that the parallel selection/expansion/backup rules are equivalent (or dominate) the sequential updates with respect to the value function underlying the guarantee, the 'principled' aspect of the contribution remains unsupported. This is load-bearing for the headline result.

minor comments (2)

[§3] Notation for particle states and parallel backup operators should be defined more explicitly before the first use to improve readability for readers unfamiliar with particle-filter variants of MCTS.
[§4] The experimental protocol (number of independent runs, statistical significance tests, and exact parallelization hardware) is only sketched; adding these details would strengthen the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this central point. We address the concern directly below and will revise the paper to make the theoretical support fully explicit.

read point-by-point responses

Referee: [Abstract and §3 (Algorithm and Theoretical Analysis)] The central claim that the parallel particle mechanism preserves formal policy improvement guarantees of sequential MCTS (for arbitrary neural network heads and search budgets) is asserted in the abstract but lacks an explicit derivation or proof sketch in the manuscript. Without showing that the parallel selection/expansion/backup rules are equivalent (or dominate) the sequential updates with respect to the value function underlying the guarantee, the 'principled' aspect of the contribution remains unsupported. This is load-bearing for the headline result.

Authors: We agree that the manuscript would benefit from a more self-contained derivation. In the revised version we will insert a concise proof sketch immediately following the algorithm description in §3. The sketch proceeds by induction on the number of particle updates and shows that the expected Q-value maintained by the parallel backup rule is a stochastic lower bound on the sequential MCTS value function; because the particle selection probabilities are constructed to match the UCT criterion in expectation, the monotonic improvement property carries over unchanged. The argument relies only on the standard assumptions of MCTS (finite action space, bounded rewards) and holds for any fixed neural-network policy/value heads and any search budget. We will also add a short remark clarifying that the guarantee is preserved in expectation over the particle sampling process. revision: yes

Circularity Check

0 steps flagged

PMCTS preserves sequential MCTS guarantees via explicit parallel update rules rather than by redefinition or self-fit

full rationale

The paper introduces PMCTS as a parallel variant whose selection, expansion, and backup steps are constructed to match the information flow of sequential MCTS, thereby inheriting its policy-improvement property under the same value-function assumptions. No equation in the provided text defines a quantity in terms of itself or renames a fitted parameter as a prediction; the guarantee is not asserted by self-citation alone but follows from the algorithmic equivalence shown in the method section. The derivation therefore remains self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information available from abstract alone to enumerate free parameters, axioms, or invented entities; full manuscript would be required to audit the derivation of the parallel guarantees.

pith-pipeline@v0.9.0 · 5640 in / 1025 out tokens · 24192 ms · 2026-05-22T09:50:13.370718+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

[1]

Bandit based monte-carlo planning,

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InThe 17th European Conference on Machine Learning, 2006. doi: 10.1007/11871842\_29

work page doi:10.1007/11871842 2006
[2]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...

work page doi:10.1038/nature16961 2016
[3]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018. doi: 10.1126...

work page doi:10.1126/science.aar6404 2018
[4]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. doi: 10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
[5]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi: 1...

work page doi:10.1038/s41586-022-05172-4 2022
[6]

Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Koppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane...

work page doi:10.1038/s41586-023-06004-9 2023
[7]

Muzero with self-competition for rate control in vp9 video compression.arXiv preprint arXiv:2202.06626, 2022

Amol Mandhane, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang, Rene Claus, Ching-Han Chiang, et al. Muzero with self-competition for rate control in vp9 video compression.arXiv preprint arXiv:2202.06626, 2022

work page arXiv 2022
[8]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.507

work page doi:10.18653/v1/2023.emnlp-main.507 2023
[9]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[10]

Alphazero-like tree-search can guide large language model decoding and training

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Ma...

work page 2024
[11]

SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Yang Wang. SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[12]

Rest-mcts*: Llm self-training via process reward guided tree search.The 37th Annual Conference on Advances in Neural Information Processing Systems, pages 64735–64772, 2024

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search.The 37th Annual Conference on Advances in Neural Information Processing Systems, pages 64735–64772, 2024

work page 2024
[13]

Monte carlo planning with large language model for text- based game agents

Zijing Shi, Meng Fang, and Ling Chen. Monte carlo planning with large language model for text- based game agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[14]

rstar-math: Small llms can master math reasoning with self-evolved deep thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[15]

Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. arXiv preprint arXiv:2503.04412, 2025

work page arXiv 2025
[16]

On the parallelization of UCT

Tristan Cazenave and Nicolas Jouandeau. On the parallelization of UCT. InComputer Games Workshop 207 (CGW07), 2007

work page 2007
[17]

Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel monte-carlo tree search. In6th International Conference on Computers and Games (CG 2008), 2008

work page 2008
[18]

Practical massively parallel monte- carlo tree search applied to molecular design

Xiufeng Yang, Tanuj Kr Aasawat, and Kazuki Yoshizoe. Practical massively parallel monte- carlo tree search applied to molecular design. InThe 9th International Conference on Learning Representations, 2021

work page 2021
[19]

PhD thesis, University of Paderborn, 2014

Lars Schäfers.Parallel Monte-Carlo tree search for HPC systems and its application to computer go. PhD thesis, University of Paderborn, 2014

work page 2014
[20]

On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020

Anji Liu, Yitao Liang, Ji Liu, Guy Van den Broeck, and Jianshu Chen. On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020. 11

work page arXiv 2006
[21]

Batch monte carlo tree search

Tristan Cazenave. Batch monte carlo tree search. InComputers and Games - International Conference, 2022. doi: 10.1007/978-3-031-34017-8\_13

work page doi:10.1007/978-3-031-34017-8 2022
[22]

Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

work page 2011
[23]

Policy improvement by planning with Gumbel

Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations, 2022

work page 2022
[24]

Monte-Carlo Tree Search as Regularized Policy Optimization

Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InThe 37th International Conference on Machine Learning, 2020

work page 2020
[25]

Springer Series in Statistics

Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition, 2020. doi: 10.1007/ 978-3-030-47845-2

work page 2020
[26]

A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957

Richard Bellman. A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957

work page 1957
[27]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based Reinforcement Learning: A Survey.Foundations and Trends® in Machine Learning, 16(1): 1–118, 2023. doi: 10.1561/2200000086

work page doi:10.1561/2200000086 2023
[28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition, 2018

work page 2018
[29]

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, and Wendelin Boehmer. Epistemic Monte Carlo Tree Search. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[30]

Hubert, and David Silver

Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K. Hubert, and David Silver. Planning in stochastic environments with a learned model. InThe Tenth International Conference on Learning Representations, 2022

work page 2022
[31]

Learning and Planning in Complex Action Spaces

Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and Planning in Complex Action Spaces. InThe 38th International Conference on Machine Learning, 2021

work page 2021
[32]

Efficientzero V2: mas- tering discrete and continuous control with limited data

Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero V2: mas- tering discrete and continuous control with limited data. InForty-first International Conference on Machine Learning, 2024

work page 2024
[33]

Probabilistic planning with sequential monte carlo methods

Alexandre Piché, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential monte carlo methods. InThe 7th International Conference on Learning Representations, 2019

work page 2019
[34]

Twice sequential monte carlo for tree search.The 43 International Conference on Machine Learning, 2026

Yaniv Oren, Joery A de Vries, Pascal R van der Vaart, Matthijs TJ Spaan, and Wendelin Böhmer. Twice sequential monte carlo for tree search.The 43 International Conference on Machine Learning, 2026

work page 2026
[35]

de Vries, Jinke He, Yaniv Oren, and Matthijs T

Joery A. de Vries, Jinke He, Yaniv Oren, and Matthijs T. J. Spaan. Trust-Region Twisted Policy Improvement. InThe 42 International Conference on Machine Learning, 2025

work page 2025
[36]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax

work page 2018
[37]

Monte-Carlo planning in large POMDPs

David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. InThe 24th Annual Conference on Neural Information Processing Systems, 2010. 12

work page 2010
[38]

DESPOT: online POMDP planning with regularization

Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: online POMDP planning with regularization. InThe 27th Annual Conference on Neural Information Processing Systems, 2013

work page 2013
[39]

Sunberg and Mykel J

Zachary N. Sunberg and Mykel J. Kochenderfer. Online algorithms for POMDPs with continu- ous state, action, and observation spaces. InThe 28th International Conference on Automated Planning and Scheduling, 2018

work page 2018
[40]

Iris Bahar

Semanti Basu, Sreshtaa Rajesh, Kaiyu Zheng, Stefanie Tellex, and R. Iris Bahar. Parallelizing POMCP to solve complex POMDPs.RSS workshop on software tools for real-time optimal control, 2021

work page 2021
[41]

HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty

Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee. HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty. InRobotics: Science and Systems XIV, 2018

work page 2018
[42]

John Wiley & Sons, 2008

Joachim Hartung, Guido Knapp, and Bimal K Sinha.Statistical meta-analysis with applications. John Wiley & Sons, 2008

work page 2008
[43]

Temporal Difference Learning for Model Predictive Control

Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. InThe 39th International Conference on Machine Learning, 2022

work page 2022
[44]

TD-MPC2: scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[45]

Bootstrapped model predictive control

Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, and Xuguang Lan. Bootstrapped model predictive control. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[46]

General tree evaluation for AlphaZero

Albin Jaldevik. General tree evaluation for AlphaZero. Master’s thesis, Delft University of Technology, 2024

work page 2024
[47]

Pgx: Hardware-accelerated parallel game simulators for reinforcement learning

Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, and Shin Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning. InThe 36th Annual Conference on Advances in Neural Information Processing Systems, 2023

work page 2023
[48]

Jumanji: a diverse suite of scalable reinforcement learning environments in JAX

Clément Bonnet, Daniel Luo, Donal John Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries Petrus Smit, Nathan Grinsztajn, Raphael Boige, Cemlyn Neil Waters, Mohamed Ali Ali Mimouni, Ulrich Armel Mbou Sob, Ruan John de Kock, Sidd...

work page 2024
[49]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021
[50]

Bayesian Elo Rating

Rémi Coulom. Bayesian Elo Rating. https://www.remi-coulom.fr/Bayesian-Elo/,

work page
[51]

[Online; accessed 02-05-2024]

work page 2024
[52]

The DeepMind JAX Ecosystem, 2020

DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...

work page 2020
[53]

Almost Optimal Exploration in Multi-Armed Bandits

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost Optimal Exploration in Multi-Armed Bandits. InThe 30th International Conference on Machine Learning, 2013. 13

work page 2013
[54]

BR-SNIS: bias reduced self-normalized importance sampling.The 35th Annual Conference on Advances in Neural Information Processing Systems, 2022

Gabriel Cardoso, Sergey Samsonov, Achille Thin, Eric Moulines, and Jimmy Olsson. BR-SNIS: bias reduced self-normalized importance sampling.The 35th Annual Conference on Advances in Neural Information Processing Systems, 2022

work page 2022
[55]

Kogge and Harold S

Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, C-22(8):786–793, 1973. doi: 10.1109/TC.1973.5009159

work page doi:10.1109/tc.1973.5009159 1973
[56]

Blelloch

G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122

work page doi:10.1109/12.42122 1989
[57]

Fischer, and Nancy A

Eshrat Arjomandi, Michael J. Fischer, and Nancy A. Lynch. A difference in efficiency between synchronous and asynchronous systems. InThe 13th Annual ACM Symposium on Theory of Computing, 1981. doi: 10.1145/800076.802466

work page doi:10.1145/800076.802466 1981
[58]

Ali Mirsoleimani, Aske Plaat, H

S. Ali Mirsoleimani, Aske Plaat, H. Jaap van den Herik, and Jos Vermaseren. An analysis of virtual loss in parallel MCTS. InThe 9th International Conference on Agents and Artificial Intelligence, 2017. doi: 10.5220/0006205806480652

work page doi:10.5220/0006205806480652 2017
[59]

A lock-free multithreaded monte-carlo tree search algorithm

Markus Enzenberger and Martin Müller. A lock-free multithreaded monte-carlo tree search algorithm. InThe 12th International Conference on Advances in Computer Games, 2009. doi: 10.1007/978-3-642-12993-3\_2

work page doi:10.1007/978-3-642-12993-3 2009
[60]

Ali Mirsoleimani, H

S. Ali Mirsoleimani, H. Jaap van den Herik, Aske Plaat, and Jos Vermaseren. A lock-free algorithm for parallel MCTS. InThe 10th International Conference on Agents and Artificial Intelligence, 2018

work page 2018
[61]

Transzero: Parallel tree expansion in muzero using transformer networks.arXiv preprint arXiv:2509.11233, 2025

Emil Malmsten and Wendelin Böhmer. Transzero: Parallel tree expansion in muzero using transformer networks.arXiv preprint arXiv:2509.11233, 2025

work page arXiv 2025
[62]

Kandemir, and Ding-Yong Hong

Scott Cheng, Mahmut T. Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. InThe 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[63]

Specmcts: Accelerating monte carlo tree search using speculative tree traversal.IEEE Access, 9:142195–142205, 2021

Juhwan Kim, Byeongmin Kang, and Hyungmin Cho. Specmcts: Accelerating monte carlo tree search using speculative tree traversal.IEEE Access, 9:142195–142205, 2021. doi: 10.1109/ACCESS.2021.3120384

work page doi:10.1109/access.2021.3120384 2021
[64]

Multiple policy value monte carlo tree search

Li-Cheng Lan, Wei Li, Ting-Han Wei, and I-Chen Wu. Multiple policy value monte carlo tree search. InThe Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. doi: 10.24963/IJCAI.2019/653

work page doi:10.24963/ijcai.2019/653 2019
[65]

Value Improved Actor Critic Algorithms

Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. InThe 39th Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[66]

restarting

David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. 14 Appendix Contents A Acronym and Symbols List 17 B Pseudocode 17 C Derivations 19 C.1 Derivation of the numerically stable weighted average . . . . . . . . . . . . . . . . 19 C.2 Derivation of the particle-based backpropagation step in PMCTS...

work page 2004
[67]

Limitations

(III) PMCTS is principled, in that it retains the same properties established for MCTS. This is supported by Section 5. (IV) That PMCTS is the first parallel and principled MCTS algorithm, to our knowledge. This is supported by Section 3 and Appendix F. Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made ...

work page
[68]

important, original, or non-standard component of the core methods

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Bandit based monte-carlo planning,

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InThe 17th European Conference on Machine Learning, 2006. doi: 10.1007/11871842\_29

work page doi:10.1007/11871842 2006

[2] [2]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...

work page doi:10.1038/nature16961 2016

[3] [3]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018. doi: 10.1126...

work page doi:10.1126/science.aar6404 2018

[4] [4]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. doi: 10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020

[5] [5]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi: 1...

work page doi:10.1038/s41586-022-05172-4 2022

[6] [6]

Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Koppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane...

work page doi:10.1038/s41586-023-06004-9 2023

[7] [7]

Muzero with self-competition for rate control in vp9 video compression.arXiv preprint arXiv:2202.06626, 2022

Amol Mandhane, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang, Rene Claus, Ching-Han Chiang, et al. Muzero with self-competition for rate control in vp9 video compression.arXiv preprint arXiv:2202.06626, 2022

work page arXiv 2022

[8] [8]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.507

work page doi:10.18653/v1/2023.emnlp-main.507 2023

[9] [9]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[10] [10]

Alphazero-like tree-search can guide large language model decoding and training

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Ma...

work page 2024

[11] [11]

SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Yang Wang. SWE-search: Enhancing software agents with monte carlo tree search and iterative refinement. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[12] [12]

Rest-mcts*: Llm self-training via process reward guided tree search.The 37th Annual Conference on Advances in Neural Information Processing Systems, pages 64735–64772, 2024

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search.The 37th Annual Conference on Advances in Neural Information Processing Systems, pages 64735–64772, 2024

work page 2024

[13] [13]

Monte carlo planning with large language model for text- based game agents

Zijing Shi, Meng Fang, and Ling Chen. Monte carlo planning with large language model for text- based game agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[14] [14]

rstar-math: Small llms can master math reasoning with self-evolved deep thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. In Forty-second International Conference on Machine Learning, 2025

work page 2025

[15] [15]

Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. arXiv preprint arXiv:2503.04412, 2025

work page arXiv 2025

[16] [16]

On the parallelization of UCT

Tristan Cazenave and Nicolas Jouandeau. On the parallelization of UCT. InComputer Games Workshop 207 (CGW07), 2007

work page 2007

[17] [17]

Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel monte-carlo tree search. In6th International Conference on Computers and Games (CG 2008), 2008

work page 2008

[18] [18]

Practical massively parallel monte- carlo tree search applied to molecular design

Xiufeng Yang, Tanuj Kr Aasawat, and Kazuki Yoshizoe. Practical massively parallel monte- carlo tree search applied to molecular design. InThe 9th International Conference on Learning Representations, 2021

work page 2021

[19] [19]

PhD thesis, University of Paderborn, 2014

Lars Schäfers.Parallel Monte-Carlo tree search for HPC systems and its application to computer go. PhD thesis, University of Paderborn, 2014

work page 2014

[20] [20]

On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020

Anji Liu, Yitao Liang, Ji Liu, Guy Van den Broeck, and Jianshu Chen. On effective paralleliza- tion of monte carlo tree search.CoRR, abs/2006.08785, 2020. 11

work page arXiv 2006

[21] [21]

Batch monte carlo tree search

Tristan Cazenave. Batch monte carlo tree search. InComputers and Games - International Conference, 2022. doi: 10.1007/978-3-031-34017-8\_13

work page doi:10.1007/978-3-031-34017-8 2022

[22] [22]

Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

work page 2011

[23] [23]

Policy improvement by planning with Gumbel

Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations, 2022

work page 2022

[24] [24]

Monte-Carlo Tree Search as Regularized Policy Optimization

Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InThe 37th International Conference on Machine Learning, 2020

work page 2020

[25] [25]

Springer Series in Statistics

Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition, 2020. doi: 10.1007/ 978-3-030-47845-2

work page 2020

[26] [26]

A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957

Richard Bellman. A Markovian Decision Process.Journal of Mathematics and Mechanics, 6 (5):679–684, 1957

work page 1957

[27] [27]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based Reinforcement Learning: A Survey.Foundations and Trends® in Machine Learning, 16(1): 1–118, 2023. doi: 10.1561/2200000086

work page doi:10.1561/2200000086 2023

[28] [28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition, 2018

work page 2018

[29] [29]

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, and Wendelin Boehmer. Epistemic Monte Carlo Tree Search. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[30] [30]

Hubert, and David Silver

Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K. Hubert, and David Silver. Planning in stochastic environments with a learned model. InThe Tenth International Conference on Learning Representations, 2022

work page 2022

[31] [31]

Learning and Planning in Complex Action Spaces

Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and Planning in Complex Action Spaces. InThe 38th International Conference on Machine Learning, 2021

work page 2021

[32] [32]

Efficientzero V2: mas- tering discrete and continuous control with limited data

Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero V2: mas- tering discrete and continuous control with limited data. InForty-first International Conference on Machine Learning, 2024

work page 2024

[33] [33]

Probabilistic planning with sequential monte carlo methods

Alexandre Piché, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential monte carlo methods. InThe 7th International Conference on Learning Representations, 2019

work page 2019

[34] [34]

Twice sequential monte carlo for tree search.The 43 International Conference on Machine Learning, 2026

Yaniv Oren, Joery A de Vries, Pascal R van der Vaart, Matthijs TJ Spaan, and Wendelin Böhmer. Twice sequential monte carlo for tree search.The 43 International Conference on Machine Learning, 2026

work page 2026

[35] [35]

de Vries, Jinke He, Yaniv Oren, and Matthijs T

Joery A. de Vries, Jinke He, Yaniv Oren, and Matthijs T. J. Spaan. Trust-Region Twisted Policy Improvement. InThe 42 International Conference on Machine Learning, 2025

work page 2025

[36] [36]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax

work page 2018

[37] [37]

Monte-Carlo planning in large POMDPs

David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. InThe 24th Annual Conference on Neural Information Processing Systems, 2010. 12

work page 2010

[38] [38]

DESPOT: online POMDP planning with regularization

Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: online POMDP planning with regularization. InThe 27th Annual Conference on Neural Information Processing Systems, 2013

work page 2013

[39] [39]

Sunberg and Mykel J

Zachary N. Sunberg and Mykel J. Kochenderfer. Online algorithms for POMDPs with continu- ous state, action, and observation spaces. InThe 28th International Conference on Automated Planning and Scheduling, 2018

work page 2018

[40] [40]

Iris Bahar

Semanti Basu, Sreshtaa Rajesh, Kaiyu Zheng, Stefanie Tellex, and R. Iris Bahar. Parallelizing POMCP to solve complex POMDPs.RSS workshop on software tools for real-time optimal control, 2021

work page 2021

[41] [41]

HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty

Panpan Cai, Yuanfu Luo, David Hsu, and Wee Sun Lee. HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty. InRobotics: Science and Systems XIV, 2018

work page 2018

[42] [42]

John Wiley & Sons, 2008

Joachim Hartung, Guido Knapp, and Bimal K Sinha.Statistical meta-analysis with applications. John Wiley & Sons, 2008

work page 2008

[43] [43]

Temporal Difference Learning for Model Predictive Control

Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. InThe 39th International Conference on Machine Learning, 2022

work page 2022

[44] [44]

TD-MPC2: scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[45] [45]

Bootstrapped model predictive control

Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, and Xuguang Lan. Bootstrapped model predictive control. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[46] [46]

General tree evaluation for AlphaZero

Albin Jaldevik. General tree evaluation for AlphaZero. Master’s thesis, Delft University of Technology, 2024

work page 2024

[47] [47]

Pgx: Hardware-accelerated parallel game simulators for reinforcement learning

Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, and Shin Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning. InThe 36th Annual Conference on Advances in Neural Information Processing Systems, 2023

work page 2023

[48] [48]

Jumanji: a diverse suite of scalable reinforcement learning environments in JAX

Clément Bonnet, Daniel Luo, Donal John Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries Petrus Smit, Nathan Grinsztajn, Raphael Boige, Cemlyn Neil Waters, Mohamed Ali Ali Mimouni, Ulrich Armel Mbou Sob, Ruan John de Kock, Sidd...

work page 2024

[49] [49]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021

[50] [50]

Bayesian Elo Rating

Rémi Coulom. Bayesian Elo Rating. https://www.remi-coulom.fr/Bayesian-Elo/,

work page

[51] [51]

[Online; accessed 02-05-2024]

work page 2024

[52] [52]

The DeepMind JAX Ecosystem, 2020

DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...

work page 2020

[53] [53]

Almost Optimal Exploration in Multi-Armed Bandits

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost Optimal Exploration in Multi-Armed Bandits. InThe 30th International Conference on Machine Learning, 2013. 13

work page 2013

[54] [54]

BR-SNIS: bias reduced self-normalized importance sampling.The 35th Annual Conference on Advances in Neural Information Processing Systems, 2022

Gabriel Cardoso, Sergey Samsonov, Achille Thin, Eric Moulines, and Jimmy Olsson. BR-SNIS: bias reduced self-normalized importance sampling.The 35th Annual Conference on Advances in Neural Information Processing Systems, 2022

work page 2022

[55] [55]

Kogge and Harold S

Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, C-22(8):786–793, 1973. doi: 10.1109/TC.1973.5009159

work page doi:10.1109/tc.1973.5009159 1973

[56] [56]

Blelloch

G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122

work page doi:10.1109/12.42122 1989

[57] [57]

Fischer, and Nancy A

Eshrat Arjomandi, Michael J. Fischer, and Nancy A. Lynch. A difference in efficiency between synchronous and asynchronous systems. InThe 13th Annual ACM Symposium on Theory of Computing, 1981. doi: 10.1145/800076.802466

work page doi:10.1145/800076.802466 1981

[58] [58]

Ali Mirsoleimani, Aske Plaat, H

S. Ali Mirsoleimani, Aske Plaat, H. Jaap van den Herik, and Jos Vermaseren. An analysis of virtual loss in parallel MCTS. InThe 9th International Conference on Agents and Artificial Intelligence, 2017. doi: 10.5220/0006205806480652

work page doi:10.5220/0006205806480652 2017

[59] [59]

A lock-free multithreaded monte-carlo tree search algorithm

Markus Enzenberger and Martin Müller. A lock-free multithreaded monte-carlo tree search algorithm. InThe 12th International Conference on Advances in Computer Games, 2009. doi: 10.1007/978-3-642-12993-3\_2

work page doi:10.1007/978-3-642-12993-3 2009

[60] [60]

Ali Mirsoleimani, H

S. Ali Mirsoleimani, H. Jaap van den Herik, Aske Plaat, and Jos Vermaseren. A lock-free algorithm for parallel MCTS. InThe 10th International Conference on Agents and Artificial Intelligence, 2018

work page 2018

[61] [61]

Transzero: Parallel tree expansion in muzero using transformer networks.arXiv preprint arXiv:2509.11233, 2025

Emil Malmsten and Wendelin Böhmer. Transzero: Parallel tree expansion in muzero using transformer networks.arXiv preprint arXiv:2509.11233, 2025

work page arXiv 2025

[62] [62]

Kandemir, and Ding-Yong Hong

Scott Cheng, Mahmut T. Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. InThe 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[63] [63]

Specmcts: Accelerating monte carlo tree search using speculative tree traversal.IEEE Access, 9:142195–142205, 2021

Juhwan Kim, Byeongmin Kang, and Hyungmin Cho. Specmcts: Accelerating monte carlo tree search using speculative tree traversal.IEEE Access, 9:142195–142205, 2021. doi: 10.1109/ACCESS.2021.3120384

work page doi:10.1109/access.2021.3120384 2021

[64] [64]

Multiple policy value monte carlo tree search

Li-Cheng Lan, Wei Li, Ting-Han Wei, and I-Chen Wu. Multiple policy value monte carlo tree search. InThe Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. doi: 10.24963/IJCAI.2019/653

work page doi:10.24963/ijcai.2019/653 2019

[65] [65]

Value Improved Actor Critic Algorithms

Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. InThe 39th Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[66] [66]

restarting

David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. 14 Appendix Contents A Acronym and Symbols List 17 B Pseudocode 17 C Derivations 19 C.1 Derivation of the numerically stable weighted average . . . . . . . . . . . . . . . . 19 C.2 Derivation of the particle-based backpropagation step in PMCTS...

work page 2004

[67] [67]

Limitations

(III) PMCTS is principled, in that it retains the same properties established for MCTS. This is supported by Section 5. (IV) That PMCTS is the first parallel and principled MCTS algorithm, to our knowledge. This is supported by Section 3 and Appendix F. Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made ...

work page

[68] [68]

important, original, or non-standard component of the core methods

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page