Twice Sequential Monte Carlo for Tree Search
Pith reviewed 2026-05-22 12:58 UTC · model grok-4.3
The pith
Running Sequential Monte Carlo twice in sequence reduces variance and path degeneracy in tree search for reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TSMCTS applies Sequential Monte Carlo twice in sequence as a tree-search operator; the second pass operates on the particles produced by the first, which the authors show lowers estimator variance, lessens path degeneracy, and yields stronger policy improvement than either single-pass SMC or a popular MCTS implementation, all while retaining SMC's suitability for parallel hardware.
What carries the argument
Twice Sequential Monte Carlo Tree Search (TSMCTS), a two-stage particle-filter procedure in which the second SMC run refines the weighted particles from the first run to produce the final action-value estimates.
If this is right
- TSMCTS serves as a stronger policy-improvement operator than either SMC or modern MCTS in both discrete and continuous domains.
- The method scales favorably when more sequential compute is allocated to deeper searches.
- Estimator variance drops and the impact of path degeneracy is reduced while parallelization properties of SMC are kept intact.
Where Pith is reading between the lines
- Stacking additional SMC stages might be explored as a general pattern for combating degeneracy in other particle-based planning algorithms.
- The same double-pass idea could be tested as a drop-in replacement inside existing model-based RL pipelines that already use SMC.
- If the variance reduction holds, TSMCTS might allow reliable search at depths where single SMC previously became unusable.
Load-bearing premise
That a second SMC pass will reliably cut variance and path degeneracy enough to improve policy quality without adding prohibitive extra cost or new biases in the tested environments.
What would settle it
In controlled experiments with increased search depth, if TSMCTS shows no reduction in estimator variance or no gain in policy performance over single SMC, the central claim would be falsified.
Figures
read the original abstract
Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Twice Sequential Monte Carlo Tree Search (TSMCTS), which applies Sequential Monte Carlo twice within a tree-search framework for model-based reinforcement learning. The central claim is that this double-pass construction outperforms a standard SMC baseline and a modern MCTS variant as a policy-improvement operator across discrete and continuous environments, scales favorably with sequential compute, reduces estimator variance, and mitigates path degeneracy while preserving SMC’s parallelizability.
Significance. If the empirical claims hold, the work is significant because it directly targets the scaling limitations that have hindered SMC adoption relative to MCTS in deep search settings. Retaining GPU-friendly parallelism while demonstrating improved policy improvement and variance reduction could influence practical implementations of search-based RL. The evaluation across environment classes and the focus on sequential-compute scaling are strengths that merit attention.
major comments (3)
- [Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.
- [Experimental results] Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.
- [Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.
minor comments (2)
- [Abstract] The abstract refers to “a popular modern version of MCTS” without naming the specific algorithm (e.g., PUCT with neural value function); the experiments section should state the exact baseline implementation for reproducibility.
- [Section 3] Notation for particle weights and effective sample size is introduced in Section 3 but never compared to standard SMC references (Doucet et al., 2001); adding one sentence linking the definitions would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.
Authors: We agree that the current description leaves room for ambiguity. In TSMCTS the second SMC pass reuses the particles from the first pass and conditions new proposals on the shared prefixes of branches already explored in the first pass; resampling then occurs at the identified branch points. This structure is intended to reduce degeneracy along common paths rather than merely relocating it. We will expand the text in Section 4.2 and augment Algorithm 2 with explicit steps and pseudocode to document the reuse and conditioning mechanism. revision: yes
-
Referee: Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.
Authors: This is a fair criticism. The manuscript does not currently report the number of runs or statistical procedures. All results were obtained from 10 independent trials per environment using distinct random seeds; error bars in the figures represent standard error of the mean, and pairwise t-tests were used to assess significance of differences versus the SMC baseline. We will add this information to the experimental section and figure captions in the revision. revision: yes
-
Referee: [Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.
Authors: We acknowledge that a precise accounting of sequential operations would strengthen the scaling claim. The second pass performs resampling and reweighting over the fixed set of particles and tree prefixes produced by the first pass; the number of additional sequential steps is therefore bounded and does not increase linearly with search depth. We will add an explicit quantification of these steps and a short discussion in the revised Section 5.1 to clarify why overall scaling remains favorable. revision: yes
Circularity Check
No circularity: empirical claims rest on experimental comparisons
full rationale
The paper proposes TSMCTS as an algorithmic extension of SMC for tree search to mitigate variance and path degeneracy. All central claims of outperformance, favorable scaling, and retained parallelizability are supported by direct empirical evaluations in discrete and continuous environments rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described content; the construction is presented as a novel double-pass procedure whose benefits are tested externally.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.
Reference graph
Works this paper leans on
-
[1]
Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo Tree Search. InComputers and Games, CG 2008, volume 5131 ofLecture Notes in Computer Science, pp. 60–71. Springer, Berlin, Heidelberg,
work page 2008
-
[2]
doi: 10.1007/978-3-540-87608-3_6. Nicolas Chopin. Central Limit Theorem for Sequential Monte Carlo Methods and its Ap- plication to Bayesian Inference.The Annals of Statistics, 32(6):2385–2411,
-
[3]
Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo
doi: 10.1214/009053604000000698. Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition,
-
[4]
Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver
doi: 10.1007/978-3-030-47845-2. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations,
-
[5]
doi: 10.2307/2289144. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grze- gorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610...
-
[6]
doi: 10.1038/s41586-022-05172-4. C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1),
-
[7]
doi: 10.1162/neco.1992.4.1.1. Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InProceedings of the 37th International Conference on Machine Learning, volume 119, pp. 3769–3778. PMLR,
-
[8]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
doi: 10.1038/s41586-023-06004-9. Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. arXiv:2406.01423,
-
[10]
Proximal Policy Optimization Algorithms
doi: 10.1038/s41586-020-03051-4. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-020-03051-4
-
[11]
doi: 10.1126/science. aar6404. Richard S. Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3 (1):9–44,
-
[12]
doi: 10.1007/BF00115009. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition,
-
[13]
doi: 10.1007/s10462-022-10228-y. A THEORETICALRESULTS A.1 RL-SMCIS A POLICY IMPROVEMENT OPERATOR Proof. Given exact evaluation Qπ, true environment model P, r, a starting state s0 and infinitely many particles N→ ∞ , the SMC target policy at final step T produces the following distribution 12 Preprint. Under review. over trajectories: p(τT ) =p(s 0, a0, ....
-
[14]
operations. In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B). RL-SMC complexityFor N particles and a depth T , the search budget of RL-SMC totalsN T=B . Assuming that N particles operate in parallel the (sequential) runtime complexity isO(T)≤ O(B)< O(B2) operations. In terms of space, RL-SMC mainta...
work page 2019
-
[15]
Following that, the agent proceeds to gather a additional data of size LB
and the above losses. Following that, the agent proceeds to gather a additional data of size LB. The AdamW optimizer Loshchilov & Hutter (2019) was used with an l2 penalty of 10−6 and a learning rate of 3·10 −3. Gradients were clipped using a max absolute value of 10 and a global norm limit of
work page 2019
-
[16]
Clearly, limiting the search to only the top two actions is strongly detrimental
Performance is summarized as area-under-the-curve (AUC) for the evaluation returns during training normalized across environments with respect to minimum and maximum AUCs observed across agents and seeds. Clearly, limiting the search to only the top two actions is strongly detrimental. On the other hand the confidence bounds for m1 = 4,16 almost entirely ...
work page 2025
-
[17]
MCTS is extended to the continuous-action domain using Hubert et al
Ant, Halfcheetah and Humanoid. MCTS is extended to the continuous-action domain using Hubert et al. (2021)’s SampledMZ approach. ComputeAll experiments were run on the Delft AI Cluster (DAIC) (2024) cluster with a mix of [anonymized for review] Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards. Each individual run (seed) used 2 CPU cores and≤...
work page 2021
-
[18]
=P a∈M πimproved(a|s)Qsearch(s, a) at the end of training as a function of depth for each method. M is the number of actions over which the estimator maintains information (susceptible to path degeneracy). Following de Vries et al. (2025), for TRT-SMC and the SMC baseline we compute Qsearch as the TD-λ estimator for each particle at the root for thelastde...
work page 2025
-
[19]
(ii) The βroot inverse-temperature hyperparameter ofIGM Z used by T/SMCTS to compute the improved policy at the root (Iroot). For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1. β= 0.01 −1 was overall the best performer. The βsearch hyperparameter is actually the same paramete...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.