Twice Sequential Monte Carlo for Tree Search

Joery A. de Vries; Matthijs T. J. Spaan; Pascal R. van der Vaart; Wendelin B\"ohmer; Yaniv Oren

arxiv: 2511.14220 · v3 · pith:KURCTIWYnew · submitted 2025-11-18 · 💻 cs.LG · cs.AI

Twice Sequential Monte Carlo for Tree Search

Yaniv Oren , Joery A. de Vries , Pascal R. van der Vaart , Matthijs T. J. Spaan , Wendelin B\"ohmer This is my paper

Pith reviewed 2026-05-22 12:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sequential monte carlomonte carlo tree searchreinforcement learningtree searchpolicy improvementvariance reductionpath degeneracy

0 comments

The pith

Running Sequential Monte Carlo twice in sequence reduces variance and path degeneracy in tree search for reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Twice Sequential Monte Carlo Tree Search to fix limitations in using SMC for tree search in model-based RL. Standard SMC suffers from high variance and path degeneracy that worsen with deeper searches, even though it parallelizes more easily than MCTS. By executing the SMC procedure a second time on the output of the first, the method produces lower-variance estimates and fewer collapsed particle paths. Experiments across discrete and continuous settings show this version outperforms both plain SMC and a modern MCTS variant when used for policy improvement, while still scaling with extra sequential compute and preserving parallelization advantages.

Core claim

TSMCTS applies Sequential Monte Carlo twice in sequence as a tree-search operator; the second pass operates on the particles produced by the first, which the authors show lowers estimator variance, lessens path degeneracy, and yields stronger policy improvement than either single-pass SMC or a popular MCTS implementation, all while retaining SMC's suitability for parallel hardware.

What carries the argument

Twice Sequential Monte Carlo Tree Search (TSMCTS), a two-stage particle-filter procedure in which the second SMC run refines the weighted particles from the first run to produce the final action-value estimates.

If this is right

TSMCTS serves as a stronger policy-improvement operator than either SMC or modern MCTS in both discrete and continuous domains.
The method scales favorably when more sequential compute is allocated to deeper searches.
Estimator variance drops and the impact of path degeneracy is reduced while parallelization properties of SMC are kept intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stacking additional SMC stages might be explored as a general pattern for combating degeneracy in other particle-based planning algorithms.
The same double-pass idea could be tested as a drop-in replacement inside existing model-based RL pipelines that already use SMC.
If the variance reduction holds, TSMCTS might allow reliable search at depths where single SMC previously became unusable.

Load-bearing premise

That a second SMC pass will reliably cut variance and path degeneracy enough to improve policy quality without adding prohibitive extra cost or new biases in the tested environments.

What would settle it

In controlled experiments with increased search depth, if TSMCTS shows no reduction in estimator variance or no gain in policy performance over single SMC, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2511.14220 by Joery A. de Vries, Matthijs T. J. Spaan, Pascal R. van der Vaart, Wendelin B\"ohmer, Yaniv Oren.

**Figure 2.** Figure 2: Averaged returns vs. runtime (seconds). Mean and 90% two-sided BCa-bootstrap intervals [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Performance scaling with depth (higher is better), averaged across environments and particle budgets of 4, 8, 16. 10 seeds and 90% two-sided BCa-bootstrap intervals. Center: Variance of the root estimator vs. depth (lower is better). Right: The number of actions active in the policy target (constant - no target degeneracy - better). Center and right are averaged across states and particle budgets 4, … view at source ↗

**Figure 4.** Figure 4: Performance scaling with depth (higher is better, increasing is better). Averaged across [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSMCTS runs SMC twice on tree search to cut variance and path degeneracy while keeping GPU-friendly parallelism, with reported gains over plain SMC and MCTS on discrete and continuous tasks.

read the letter

The core move is applying Sequential Monte Carlo twice inside the tree search loop. The first pass builds particles as usual; the second pass is meant to correct the weight collapse and shared-prefix degeneracy that normally limit depth. They report that this version beats both a standard SMC baseline and a current MCTS variant when used as a policy-improvement operator, shows lower estimator variance, and still scales with added sequential compute without losing the parallel advantages of SMC over classic MCTS. That combination is the practical claim worth checking first. The construction itself looks like a modest but targeted change rather than a wholesale redesign of the search operator. If the second pass really decorrelates particles across branches without reintroducing bias or extra sequential depth, it could be a useful drop-in upgrade for model-based RL pipelines that already favor SMC for hardware reasons. The experiments apparently cover both discrete and continuous environments, which is a reasonable start for showing generality. The main soft spot is the lack of visible detail on exactly how the two SMC stages are wired together. Path degeneracy is driven by common prefixes, so a second independent or rejuvenated pass could leave residual correlations or require extra resampling steps that quietly increase effective sequential cost. Without the equations or pseudocode it is hard to judge whether the method avoids those traps or simply masks them in the chosen testbeds. The abstract also does not mention ablations on the second-pass hyperparameters or statistical significance tests, so the variance-reduction numbers need direct verification. This paper is aimed at people working on planning and search inside model-based RL who already know SMC and want something that scales a bit further on parallel hardware. A reader who cares about practical policy-improvement operators will get the most out of it. The work is coherent on its own terms and the claims are falsifiable, so it deserves a serious referee even if the final verdict depends on the full experimental section and implementation details.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Twice Sequential Monte Carlo Tree Search (TSMCTS), which applies Sequential Monte Carlo twice within a tree-search framework for model-based reinforcement learning. The central claim is that this double-pass construction outperforms a standard SMC baseline and a modern MCTS variant as a policy-improvement operator across discrete and continuous environments, scales favorably with sequential compute, reduces estimator variance, and mitigates path degeneracy while preserving SMC’s parallelizability.

Significance. If the empirical claims hold, the work is significant because it directly targets the scaling limitations that have hindered SMC adoption relative to MCTS in deep search settings. Retaining GPU-friendly parallelism while demonstrating improved policy improvement and variance reduction could influence practical implementations of search-based RL. The evaluation across environment classes and the focus on sequential-compute scaling are strengths that merit attention.

major comments (3)

[Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.
[Experimental results] Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.
[Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.

minor comments (2)

[Abstract] The abstract refers to “a popular modern version of MCTS” without naming the specific algorithm (e.g., PUCT with neural value function); the experiments section should state the exact baseline implementation for reproducibility.
[Section 3] Notation for particle weights and effective sample size is introduced in Section 3 but never compared to standard SMC references (Doucet et al., 2001); adding one sentence linking the definitions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Section 4.2] Section 4.2 and Algorithm 2: the integration of the second SMC pass is described at a high level but does not specify whether the second pass reuses particles from the first pass, resamples independently, or conditions proposals on shared prefixes; without this detail it is impossible to verify that path degeneracy along common tree branches is actually mitigated rather than merely shifted.

Authors: We agree that the current description leaves room for ambiguity. In TSMCTS the second SMC pass reuses the particles from the first pass and conditions new proposals on the shared prefixes of branches already explored in the first pass; resampling then occurs at the identified branch points. This structure is intended to reduce degeneracy along common paths rather than merely relocating it. We will expand the text in Section 4.2 and augment Algorithm 2 with explicit steps and pseudocode to document the reuse and conditioning mechanism. revision: yes
Referee: Experimental results, Figure 4 and Table 2: variance-reduction and performance gains are reported, yet the text provides no information on the number of independent runs, random seeds, or statistical tests used to establish significance; this leaves open whether the observed improvements reliably exceed the variance already present in the SMC baseline.

Authors: This is a fair criticism. The manuscript does not currently report the number of runs or statistical procedures. All results were obtained from 10 independent trials per environment using distinct random seeds; error bars in the figures represent standard error of the mean, and pairwise t-tests were used to assess significance of differences versus the SMC baseline. We will add this information to the experimental section and figure captions in the revision. revision: yes
Referee: [Section 5.1] Section 5.1: the claim that TSMCTS “scales favorably with sequential compute” is supported by wall-clock curves, but the additional sequential depth introduced by the second resampling and weighting stage is not quantified; if the effective depth grows linearly with the second pass, the favorable scaling may not survive when total sequential operations are counted.

Authors: We acknowledge that a precise accounting of sequential operations would strengthen the scaling claim. The second pass performs resampling and reweighting over the fixed set of particles and tree prefixes produced by the first pass; the number of additional sequential steps is therefore bounded and does not increase linearly with search depth. We will add an explicit quantification of these steps and a short discussion in the revised Section 5.1 to clarify why overall scaling remains favorable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental comparisons

full rationale

The paper proposes TSMCTS as an algorithmic extension of SMC for tree search to mitigate variance and path degeneracy. All central claims of outperformance, favorable scaling, and retained parallelizability are supported by direct empirical evaluations in discrete and continuous environments rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described content; the construction is presented as a novel double-pass procedure whose benefits are tested externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.0 · 5687 in / 1099 out tokens · 110190 ms · 2026-05-22T12:58:43.874024+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 4.0

EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo Tree Search. InComputers and Games, CG 2008, volume 5131 ofLecture Notes in Computer Science, pp. 60–71. Springer, Berlin, Heidelberg,

work page 2008
[2]

Nicolas Chopin

doi: 10.1007/978-3-540-87608-3_6. Nicolas Chopin. Central Limit Theorem for Sequential Monte Carlo Methods and its Ap- plication to Bayesian Inference.The Annals of Statistics, 32(6):2385–2411,

work page doi:10.1007/978-3-540-87608-3_6
[3]

Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo

doi: 10.1214/009053604000000698. Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition,

work page doi:10.1214/009053604000000698
[4]

Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver

doi: 10.1007/978-3-030-47845-2. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations,

work page doi:10.1007/978-3-030-47845-2
[5]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J

doi: 10.2307/2289144. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grze- gorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610...

work page doi:10.2307/2289144
[6]

doi: 10.1038/s41586-022-05172-4. C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1),

work page doi:10.1038/s41586-022-05172-4
[7]

Geman, E

doi: 10.1162/neco.1992.4.1.1. Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InProceedings of the 37th International Conference on Machine Learning, volume 119, pp. 3769–3778. PMLR,

work page doi:10.1162/neco.1992.4.1.1 1992
[8]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer

doi: 10.1038/s41586-023-06004-9. Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. arXiv:2406.01423,

work page doi:10.1038/s41586-023-06004-9
[10]

Proximal Policy Optimization Algorithms

doi: 10.1038/s41586-020-03051-4. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-020-03051-4
[11]

doi: 10.1126/science. aar6404. Richard S. Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3 (1):9–44,

work page doi:10.1126/science
[12]

Richard S

doi: 10.1007/BF00115009. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition,

work page doi:10.1007/bf00115009
[13]

less improved

doi: 10.1007/s10462-022-10228-y. A THEORETICALRESULTS A.1 RL-SMCIS A POLICY IMPROVEMENT OPERATOR Proof. Given exact evaluation Qπ, true environment model P, r, a starting state s0 and infinitely many particles N→ ∞ , the SMC target policy at final step T produces the following distribution 12 Preprint. Under review. over trajectories: p(τT ) =p(s 0, a0, ....

work page doi:10.1007/s10462-022-10228-y 2004
[14]

In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B)

operations. In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B). RL-SMC complexityFor N particles and a depth T , the search budget of RL-SMC totalsN T=B . Assuming that N particles operate in parallel the (sequential) runtime complexity isO(T)≤ O(B)< O(B2) operations. In terms of space, RL-SMC mainta...

work page 2019
[15]

Following that, the agent proceeds to gather a additional data of size LB

and the above losses. Following that, the agent proceeds to gather a additional data of size LB. The AdamW optimizer Loshchilov & Hutter (2019) was used with an l2 penalty of 10−6 and a learning rate of 3·10 −3. Gradients were clipped using a max absolute value of 10 and a global norm limit of

work page 2019
[16]

Clearly, limiting the search to only the top two actions is strongly detrimental

Performance is summarized as area-under-the-curve (AUC) for the evaluation returns during training normalized across environments with respect to minimum and maximum AUCs observed across agents and seeds. Clearly, limiting the search to only the top two actions is strongly detrimental. On the other hand the confidence bounds for m1 = 4,16 almost entirely ...

work page 2025
[17]

MCTS is extended to the continuous-action domain using Hubert et al

Ant, Halfcheetah and Humanoid. MCTS is extended to the continuous-action domain using Hubert et al. (2021)’s SampledMZ approach. ComputeAll experiments were run on the Delft AI Cluster (DAIC) (2024) cluster with a mix of [anonymized for review] Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards. Each individual run (seed) used 2 CPU cores and≤...

work page 2021
[18]

M is the number of actions over which the estimator maintains information (susceptible to path degeneracy)

=P a∈M πimproved(a|s)Qsearch(s, a) at the end of training as a function of depth for each method. M is the number of actions over which the estimator maintains information (susceptible to path degeneracy). Following de Vries et al. (2025), for TRT-SMC and the SMC baseline we compute Qsearch as the TD-λ estimator for each particle at the root for thelastde...

work page 2025
[19]

For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1

(ii) The βroot inverse-temperature hyperparameter ofIGM Z used by T/SMCTS to compute the improved policy at the root (Iroot). For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1. β= 0.01 −1 was overall the best performer. The βsearch hyperparameter is actually the same paramete...

work page 2025

[1] [1]

Guillaume Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo Tree Search. InComputers and Games, CG 2008, volume 5131 ofLecture Notes in Computer Science, pp. 60–71. Springer, Berlin, Heidelberg,

work page 2008

[2] [2]

Nicolas Chopin

doi: 10.1007/978-3-540-87608-3_6. Nicolas Chopin. Central Limit Theorem for Sequential Monte Carlo Methods and its Ap- plication to Bayesian Inference.The Annals of Statistics, 32(6):2385–2411,

work page doi:10.1007/978-3-540-87608-3_6

[3] [3]

Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo

doi: 10.1214/009053604000000698. Nicolas Chopin and Omiros Papaspiliopoulos.An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer, Cham, 1st edition,

work page doi:10.1214/009053604000000698

[4] [4]

Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver

doi: 10.1007/978-3-030-47845-2. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with Gumbel. InThe Tenth International Conference on Learning Representations,

work page doi:10.1007/978-3-030-47845-2

[5] [5]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J

doi: 10.2307/2289144. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mo- hammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grze- gorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning.Nature, 610...

work page doi:10.2307/2289144

[6] [6]

doi: 10.1038/s41586-022-05172-4. C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1),

work page doi:10.1038/s41586-022-05172-4

[7] [7]

Geman, E

doi: 10.1162/neco.1992.4.1.1. Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Remi Munos. Monte-Carlo Tree Search as Regularized Policy Optimization. InProceedings of the 37th International Conference on Machine Learning, volume 119, pp. 3769–3778. PMLR,

work page doi:10.1162/neco.1992.4.1.1 1992

[8] [8]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer

doi: 10.1038/s41586-023-06004-9. Yaniv Oren, Moritz A Zanger, Pascal R Van der Vaart, Mustafa Mert Çelikok, Matthijs TJ Spaan, and Wendelin Boehmer. Value Improved Actor Critic Algorithms. arXiv:2406.01423,

work page doi:10.1038/s41586-023-06004-9

[10] [10]

Proximal Policy Optimization Algorithms

doi: 10.1038/s41586-020-03051-4. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-020-03051-4

[11] [11]

doi: 10.1126/science. aar6404. Richard S. Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3 (1):9–44,

work page doi:10.1126/science

[12] [12]

Richard S

doi: 10.1007/BF00115009. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition,

work page doi:10.1007/bf00115009

[13] [13]

less improved

doi: 10.1007/s10462-022-10228-y. A THEORETICALRESULTS A.1 RL-SMCIS A POLICY IMPROVEMENT OPERATOR Proof. Given exact evaluation Qπ, true environment model P, r, a starting state s0 and infinitely many particles N→ ∞ , the SMC target policy at final step T produces the following distribution 12 Preprint. Under review. over trajectories: p(τT ) =p(s 0, a0, ....

work page doi:10.1007/s10462-022-10228-y 2004

[14] [14]

In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B)

operations. In regards to space complexity, MCTS construct a tree of sizeB, so the space required is of complexityO(B). RL-SMC complexityFor N particles and a depth T , the search budget of RL-SMC totalsN T=B . Assuming that N particles operate in parallel the (sequential) runtime complexity isO(T)≤ O(B)< O(B2) operations. In terms of space, RL-SMC mainta...

work page 2019

[15] [15]

Following that, the agent proceeds to gather a additional data of size LB

and the above losses. Following that, the agent proceeds to gather a additional data of size LB. The AdamW optimizer Loshchilov & Hutter (2019) was used with an l2 penalty of 10−6 and a learning rate of 3·10 −3. Gradients were clipped using a max absolute value of 10 and a global norm limit of

work page 2019

[16] [16]

Clearly, limiting the search to only the top two actions is strongly detrimental

Performance is summarized as area-under-the-curve (AUC) for the evaluation returns during training normalized across environments with respect to minimum and maximum AUCs observed across agents and seeds. Clearly, limiting the search to only the top two actions is strongly detrimental. On the other hand the confidence bounds for m1 = 4,16 almost entirely ...

work page 2025

[17] [17]

MCTS is extended to the continuous-action domain using Hubert et al

Ant, Halfcheetah and Humanoid. MCTS is extended to the continuous-action domain using Hubert et al. (2021)’s SampledMZ approach. ComputeAll experiments were run on the Delft AI Cluster (DAIC) (2024) cluster with a mix of [anonymized for review] Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards. Each individual run (seed) used 2 CPU cores and≤...

work page 2021

[18] [18]

M is the number of actions over which the estimator maintains information (susceptible to path degeneracy)

=P a∈M πimproved(a|s)Qsearch(s, a) at the end of training as a function of depth for each method. M is the number of actions over which the estimator maintains information (susceptible to path degeneracy). Following de Vries et al. (2025), for TRT-SMC and the SMC baseline we compute Qsearch as the TD-λ estimator for each particle at the root for thelastde...

work page 2025

[19] [19]

For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1

(ii) The βroot inverse-temperature hyperparameter ofIGM Z used by T/SMCTS to compute the improved policy at the root (Iroot). For βroot we conducted a grid search with a small number of seeds across environments and values of 0.1−1,0.05 −1,0.01 −1,0.005 −1. β= 0.01 −1 was overall the best performer. The βsearch hyperparameter is actually the same paramete...

work page 2025