pith. sign in

arxiv: 2511.14220 · v3 · pith:KURCTIWYnew · submitted 2025-11-18 · 💻 cs.LG · cs.AI

Twice Sequential Monte Carlo for Tree Search

Pith reviewed 2026-05-22 12:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sequential monte carlomonte carlo tree searchreinforcement learningtree searchpolicy improvementvariance reductionpath degeneracy
0
0 comments X

The pith

Running Sequential Monte Carlo twice in sequence reduces variance and path degeneracy in tree search for reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Twice Sequential Monte Carlo Tree Search to fix limitations in using SMC for tree search in model-based RL. Standard SMC suffers from high variance and path degeneracy that worsen with deeper searches, even though it parallelizes more easily than MCTS. By executing the SMC procedure a second time on the output of the first, the method produces lower-variance estimates and fewer collapsed particle paths. Experiments across discrete and continuous settings show this version outperforms both plain SMC and a modern MCTS variant when used for policy improvement, while still scaling with extra sequential compute and preserving parallelization advantages.

Core claim

TSMCTS applies Sequential Monte Carlo twice in sequence as a tree-search operator; the second pass operates on the particles produced by the first, which the authors show lowers estimator variance, lessens path degeneracy, and yields stronger policy improvement than either single-pass SMC or a popular MCTS implementation, all while retaining SMC's suitability for parallel hardware.

What carries the argument

Twice Sequential Monte Carlo Tree Search (TSMCTS), a two-stage particle-filter procedure in which the second SMC run refines the weighted particles from the first run to produce the final action-value estimates.

If this is right

  • TSMCTS serves as a stronger policy-improvement operator than either SMC or modern MCTS in both discrete and continuous domains.
  • The method scales favorably when more sequential compute is allocated to deeper searches.
  • Estimator variance drops and the impact of path degeneracy is reduced while parallelization properties of SMC are kept intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stacking additional SMC stages might be explored as a general pattern for combating degeneracy in other particle-based planning algorithms.
  • The same double-pass idea could be tested as a drop-in replacement inside existing model-based RL pipelines that already use SMC.
  • If the variance reduction holds, TSMCTS might allow reliable search at depths where single SMC previously became unusable.

Load-bearing premise

That a second SMC pass will reliably cut variance and path degeneracy enough to improve policy quality without adding prohibitive extra cost or new biases in the tested environments.

What would settle it

In controlled experiments with increased search depth, if TSMCTS shows no reduction in estimator variance or no gain in policy performance over single SMC, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2511.14220 by Joery A. de Vries, Matthijs T. J. Spaan, Pascal R. van der Vaart, Wendelin B\"ohmer, Yaniv Oren.

Figure 1
Figure 1. Figure 1: Averaged returns vs. environment interactions. Mean and 90% two-sided BCa-bootstrap [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Averaged returns vs. runtime (seconds). Mean and 90% two-sided BCa-bootstrap intervals [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Performance scaling with depth (higher is better), averaged across environments and particle budgets of 4, 8, 16. 10 seeds and 90% two-sided BCa-bootstrap intervals. Center: Variance of the root estimator vs. depth (lower is better). Right: The number of actions active in the policy target (constant - no target degeneracy - better). Center and right are averaged across states and particle budgets 4, … view at source ↗
Figure 4
Figure 4. Figure 4: Performance scaling with depth (higher is better, increasing is better). Averaged across [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Twice Sequential Monte Carlo Tree Search (TSMCTS), which applies Sequential Monte Carlo twice within a tree-search framework for model-based reinforcement learning. The central claim is that this double-pass construction outperforms a standard SMC baseline and a modern MCTS variant as a policy-improvement operator across discrete and continuous environments, scales favorably with sequential compute, reduces estimator variance, and mitigates path degeneracy while preserving SMC’s parallelizability.

Significance. If the empirical claims hold, the work is significant because it directly targets the scaling limitations that have hindered SMC adoption relative to MCTS in deep search settings. Retaining GPU-friendly parallelism while demonstrating improved policy improvement and variance reduction could influence practical implementations of search-based RL. The evaluation across environment classes and the focus on sequential-compute scaling are strengths that merit attention.

major comments (3)
  1. [Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.
  2. [Experimental results] Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.
  3. [Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.
minor comments (2)
  1. [Abstract] The abstract refers to “a popular modern version of MCTS” without naming the specific algorithm (e.g., PUCT with neural value function); the experiments section should state the exact baseline implementation for reproducibility.
  2. [Section 3] Notation for particle weights and effective sample size is introduced in Section 3 but never compared to standard SMC references (Doucet et al., 2001); adding one sentence linking the definitions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.

    Authors: We agree that the current description leaves room for ambiguity. In TSMCTS the second SMC pass reuses the particles from the first pass and conditions new proposals on the shared prefixes of branches already explored in the first pass; resampling then occurs at the identified branch points. This structure is intended to reduce degeneracy along common paths rather than merely relocating it. We will expand the text in Section 4.2 and augment Algorithm 2 with explicit steps and pseudocode to document the reuse and conditioning mechanism. revision: yes

  2. Referee: Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.

    Authors: This is a fair criticism. The manuscript does not currently report the number of runs or statistical procedures. All results were obtained from 10 independent trials per environment using distinct random seeds; error bars in the figures represent standard error of the mean, and pairwise t-tests were used to assess significance of differences versus the SMC baseline. We will add this information to the experimental section and figure captions in the revision. revision: yes

  3. Referee: [Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.

    Authors: We acknowledge that a precise accounting of sequential operations would strengthen the scaling claim. The second pass performs resampling and reweighting over the fixed set of particles and tree prefixes produced by the first pass; the number of additional sequential steps is therefore bounded and does not increase linearly with search depth. We will add an explicit quantification of these steps and a short discussion in the revised Section 5.1 to clarify why overall scaling remains favorable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental comparisons

full rationale

The paper proposes TSMCTS as an algorithmic extension of SMC for tree search to mitigate variance and path degeneracy. All central claims of outperformance, favorable scaling, and retained parallelizability are supported by direct empirical evaluations in discrete and continuous environments rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described content; the construction is presented as a novel double-pass procedure whose benefits are tested externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.0 · 5687 in / 1099 out tokens · 110190 ms · 2026-05-22T12:58:43.874024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

    cs.LG 2026-05 unverdicted novelty 4.0

    EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo Tree Search. InComputers and Games, CG 2008, volume 5131 ofLecture Notes in Computer Science, pp. 60–71. Springer, Berlin, Heidelberg,

  2. [2]

    Nicolas Chopin

    doi: 10.1007/978-3-540-87608-3_6. Nicolas Chopin. Central Limit Theorem for Sequential Monte Carlo Methods and its Ap- plication to Bayesian Inference.The Annals of Statistics, 32(6):2385–2411,

  3. [3]

    Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo

    doi: 10.1214/009053604000000698. Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition,

  4. [4]

    Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver

    doi: 10.1007/978-3-030-47845-2. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations,

  5. [5]

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J

    doi: 10.2307/2289144. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grze- gorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610...

  6. [6]

    doi: 10.1038/s41586-022-05172-4. C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1),

  7. [7]

    Geman, E

    doi: 10.1162/neco.1992.4.1.1. Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InProceedings of the 37th International Conference on Machine Learning, volume 119, pp. 3769–3778. PMLR,

  8. [8]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909,

  9. [9]

    Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer

    doi: 10.1038/s41586-023-06004-9. Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. arXiv:2406.01423,

  10. [10]

    Proximal Policy Optimization Algorithms

    doi: 10.1038/s41586-020-03051-4. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347,

  11. [11]

    doi: 10.1126/science. aar6404. Richard S. Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3 (1):9–44,

  12. [12]

    Richard S

    doi: 10.1007/BF00115009. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition,

  13. [13]

    less improved

    doi: 10.1007/s10462-022-10228-y. A THEORETICALRESULTS A.1 RL-SMCIS A POLICY IMPROVEMENT OPERATOR Proof. Given exact evaluation Qπ, true environment model P, r, a starting state s0 and infinitely many particles N→ ∞ , the SMC target policy at final step T produces the following distribution 12 Preprint. Under review. over trajectories: p(τT ) =p(s 0, a0, ....

  14. [14]

    In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B)

    operations. In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B). RL-SMC complexityFor N particles and a depth T , the search budget of RL-SMC totalsN T=B . Assuming that N particles operate in parallel the (sequential) runtime complexity isO(T)≤ O(B)< O(B2) operations. In terms of space, RL-SMC mainta...

  15. [15]

    Following that, the agent proceeds to gather a additional data of size LB

    and the above losses. Following that, the agent proceeds to gather a additional data of size LB. The AdamW optimizer Loshchilov & Hutter (2019) was used with an l2 penalty of 10−6 and a learning rate of 3·10 −3. Gradients were clipped using a max absolute value of 10 and a global norm limit of

  16. [16]

    Clearly, limiting the search to only the top two actions is strongly detrimental

    Performance is summarized as area-under-the-curve (AUC) for the evaluation returns during training normalized across environments with respect to minimum and maximum AUCs observed across agents and seeds. Clearly, limiting the search to only the top two actions is strongly detrimental. On the other hand the confidence bounds for m1 = 4,16 almost entirely ...

  17. [17]

    MCTS is extended to the continuous-action domain using Hubert et al

    Ant, Halfcheetah and Humanoid. MCTS is extended to the continuous-action domain using Hubert et al. (2021)’s SampledMZ approach. ComputeAll experiments were run on the Delft AI Cluster (DAIC) (2024) cluster with a mix of [anonymized for review] Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards. Each individual run (seed) used 2 CPU cores and≤...

  18. [18]

    M is the number of actions over which the estimator maintains information (susceptible to path degeneracy)

    =P a∈M πimproved(a|s)Qsearch(s, a) at the end of training as a function of depth for each method. M is the number of actions over which the estimator maintains information (susceptible to path degeneracy). Following de Vries et al. (2025), for TRT-SMC and the SMC baseline we compute Qsearch as the TD-λ estimator for each particle at the root for thelastde...

  19. [19]

    For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1

    (ii) The βroot inverse-temperature hyperparameter ofIGM Z used by T/SMCTS to compute the improved policy at the root (Iroot). For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1. β= 0.01 −1 was overall the best performer. The βsearch hyperparameter is actually the same paramete...