Using Common Random Numbers for Simulation-based Planning with Rollouts
Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3
The pith
Using common random numbers in rollout simulations provably reduces variance in relative utility estimates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the sampling model invokes a rollout policy beyond some depth, applying common random numbers across the trajectory generations for different actions yields a strictly lower variance for the relative utility estimates, improving action selection in the planning loop.
What carries the argument
Common random numbers applied to the sampling model so that trajectories for competing actions share the same randomness and produce correlated utility estimates.
Load-bearing premise
The sampling model must generate trajectories whose relative utilities can be compared under shared randomness without introducing bias.
What would settle it
An experiment that measures the variance of the difference between two action utilities and finds no reduction (or an increase) when common random numbers replace independent draws in the rollout phase.
Figures
read the original abstract
Simulation-based planning with rollouts is a widely-deployed technique for decision making in stochastic environments. The primary instrument of simulation-based planning is a sampling model, which is repeatedly called to generate trajectories and estimate the utilities of available actions. Among the actions thus explored, one with the maximum estimated utility is then executed. In this paper, we examine the effect of using common random numbers in the simulation process. We obtain a simple recipe for (provably) reducing variance in relative utility when simulations invoke a rollout policy beyond some depth. Experiments on synthetic tasks confirm that our scheme improves task performance. The broader significance of our innovation is apparent from two practical applications: (1) single-step lookahead planning in a pension-disbursement task, and (2) a deployment of the well-known UCT algorithm for the game of Ludo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes applying common random numbers (CRN) to the post-depth rollout phase of sampling models in simulation-based planning. It claims this yields a simple, provable reduction in the variance of relative action utilities (without biasing comparisons), leading to better action selection. The claim is supported by experiments on synthetic tasks showing improved performance, plus two applications: single-step lookahead in a pension-disbursement task and UCT for Ludo.
Significance. If the variance-reduction claim holds under the stated conditions, the technique offers a low-cost way to improve sample efficiency in rollout-based planners such as UCT/MCTS variants. The two practical applications provide concrete evidence of utility beyond synthetic benchmarks. The work correctly identifies and exploits the controllable randomness already present in many simulators.
minor comments (4)
- [§3] §3 (or wherever the variance argument appears): the derivation would benefit from an explicit side-by-side comparison of Var(U_i - U_j) under independent sampling versus CRN, including the covariance term, to make the reduction factor transparent.
- [Experiments] Experiments section: report the number of independent trials, standard errors, and any statistical tests for the performance gains on synthetic tasks and Ludo; without these the magnitude of improvement is hard to judge.
- [Notation/§2] Notation: introduce the rollout-depth parameter d explicitly when first used and keep its symbol consistent throughout.
- [Figures] Figure captions: ensure each figure states the number of simulations per action and whether error bars represent standard error or deviation.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the core contribution regarding common random numbers for variance reduction in post-depth rollouts.
Circularity Check
No significant circularity; standard CRN variance reduction applied to rollouts
full rationale
The paper's central claim is a direct application of the well-known common random numbers (CRN) technique to reduce variance of relative utilities in post-depth rollouts. This follows from the standard property that shared randomness makes the difference of two estimators have lower variance than independent sampling, without bias, provided the simulator exposes controllable randomness. No equations reduce to self-definition, no fitted parameters are relabeled as predictions, and no self-citation chain or uniqueness theorem is invoked to force the result. The derivation is self-contained against external simulation benchmarks and probabilistic facts about CRN.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment admits a sampling model that can generate trajectories under shared randomness.
Lean theorems connected to this paper
-
Statistics/MDP variance analysis — orthogonal to Cost.FunctionalEquation and Foundation forcing chainnone applicable unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2: var(X_DD) ≤ var(X_I) ... cov(V^{π1}_{M1}(s,t), V^{π2}_{M3}(s,t)) ≥ 0 by induction.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Goals-based wealth management: An integrated and practical approach to changing the structure of wealth advisory practices , author=. 2015 , publisher=
work page 2015
-
[2]
International conference on machine learning , pages=
Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=
work page 2016
-
[3]
Journal of political economy , volume=
The pricing of options and corporate liabilities , author=. Journal of political economy , volume=. 1973 , publisher=
work page 1973
- [4]
- [5]
-
[6]
Journal of portfolio management , volume=
The sharpe ratio , author=. Journal of portfolio management , volume=. 1994 , publisher=
work page 1994
-
[7]
Human-level control through deep reinforcement learning , author=. Nature , volume=. 2015 , publisher=
work page 2015
- [8]
-
[9]
Advances in neural information processing systems , volume=
Actor-critic algorithms , author=. Advances in neural information processing systems , volume=
-
[10]
Journal of financial and quantitative analysis , volume=
An analytic derivation of the efficient portfolio frontier , author=. Journal of financial and quantitative analysis , volume=. 1972 , publisher=
work page 1972
-
[11]
Computational Management Science , volume=
Dynamic portfolio allocation in goals-based wealth management , author=. Computational Management Science , volume=. 2020 , publisher=
work page 2020
-
[12]
Dynamic programming and optimal control: Volume I , author=. 2012 , publisher=
work page 2012
- [13]
- [14]
-
[15]
Sutton, Richard S and Barto, Andrew G , year=
-
[16]
Das, Sanjiv R and Ostrov, Daniel and Mittal, Sukrit and Radhakrishnan, Anand and Srivastav, Deep Ratna and Wang, Hungjen , booktitle=. 2024 , organization=
work page 2024
-
[17]
Bacinello, Anna Rita and Millossovich, Pietro and Olivieri, Annamaria and Pitacco, Ermanno , journal=. 2011 , publisher=
work page 2011
-
[18]
The Review of Economics and Statistics , urldate =
Lifetime Portfolio Selection By Dynamic Stochastic Programming , author =. The Review of Economics and Statistics , urldate =. 1969 , pages =
work page 1969
-
[19]
Das, Sanjiv Ranjan and Ostrov, Daniel N and Radhakrishnan, Anand and Srivastav, Deep , journal =. 2018 , title =. doi:10.2139/ssrn.3117765 , url =
-
[20]
Journal of Banking & Finance , author =
Dynamic optimization for multi-goals wealth management , author =. Journal of Banking; Finance , publisher =. 2022 , month =. doi:10.1016/j.jbankfin.2021.106192 , url =
-
[21]
Jucker and Jorge Alberto Garcia Gomez , volume =
James V. Jucker and Jorge Alberto Garcia Gomez , volume =. 1975 , pages =
work page 1975
-
[22]
Scientific Reports , publisher =
Quantifying the randomness of the stock markets , author =. Scientific Reports , publisher =. 2019 , month =. doi:10.1038/s41598-019-49320-9 , url =
-
[23]
Wang, Jack M. and Fleet, David J. and Hertzmann, Aaron , number =. ACM Transactions on Graphics , publisher =. 2010 , title =
work page 2010
- [24]
-
[25]
Ng, Andrew and Jordan, Michael , booktitle=
-
[26]
Annual Review of Statistics and Its Application , volume=
A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=
work page 2025
-
[27]
Available at SSRN 5289956 , year=
A Pre-trained Reinforcement Learning Approach to Goals-Based Wealth Management , author=. Available at SSRN 5289956 , year=
-
[28]
Expert systems with applications , volume=
Decision-making for financial trading: A fusion approach of machine learning and portfolio selection , author=. Expert systems with applications , volume=. 2019 , publisher=
work page 2019
-
[29]
arXiv preprint arXiv:1301.7380 , year=
Solving POMDPs by searching in policy space , author=. arXiv preprint arXiv:1301.7380 , year=
- [30]
-
[31]
Andrew Ng and H-jin Kim and Michael Jordan and Shankar Sastry , journal =. 2003 , title =
work page 2003
- [32]
-
[33]
Tim Salimans and Jonathan Ho and Xi Chen and Ilya Sutskever , volume =. 2017 , title =
work page 2017
-
[34]
Recent advances in reinforcement learning in finance , author =. 2021 , pages =
work page 2021
-
[35]
Phelim, Boyle and Mary, Hardy and Anne, MacKay and David, Saunders , journal =. 2015 , title =
work page 2015
- [36]
- [37]
-
[38]
Stable-Baselines3: Reliable Reinforcement Learning Implementations , author =. 2021 , pages =
work page 2021
-
[39]
PORTFOLIO SELECTION , author =. 1952 , pages =. doi:https://doi.org/10.1111/j.1540-6261.1952.tb01525.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-6261.1952.tb01525.x , number =
-
[40]
Moody, J. and Saffell, M. , journal=. Learning to trade via direct reinforcement , year=
-
[41]
Sharing Longevity Risk: Why Governments Should Issue Longevity Bonds , volume =
Blake, David and Boardman, Tom and Cairns, Andrew , year =. Sharing Longevity Risk: Why Governments Should Issue Longevity Bonds , volume =. North American Actuarial Journal , doi =
-
[42]
Reinforcement learning for optimized trade execution , volume =
Nevmyvaka, Yuriy and Feng, Yi and Kearns, Michael , year =. Reinforcement learning for optimized trade execution , volume =. ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning , doi =
work page 2006
-
[43]
Optimal execution of portfolio trans-actions , author=. 2000 , journal =
work page 2000
-
[44]
A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem , author=. 2017 , eprint=
work page 2017
-
[45]
QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds , author=. 2019 , eprint=
work page 2019
-
[46]
Deep Hedging: Continuous Reinforcement Learning for Hedging of General Portfolios across Multiple Risk Aversions , author=. 2022 , eprint=
work page 2022
-
[47]
Alvi, Faisal and Ahmed, Moataz , booktitle=. 2011 , organization=
work page 2011
-
[48]
2012 IEEE Conference on Computational Intelligence and Games (CIG) , pages=
TD ( ) and Q-learning based Ludo players , author=. 2012 IEEE Conference on Computational Intelligence and Games (CIG) , pages=. 2012 , organization=
work page 2012
-
[49]
2023 7th IEEE Congress on Information Science and Technology (CiSt) , pages=
Incorporating Feature Penalty in Reinforcement Learning for Ludo Game , author=. 2023 7th IEEE Congress on Information Science and Technology (CiSt) , pages=. 2023 , organization=
work page 2023
-
[50]
Vittori, Edoardo and Likmeta, Amarildo and Restelli, Marcello , booktitle=
-
[51]
2011 IEEE Conference on Computational Intelligence and Games (CIG'11) , pages=
Monte-Carlo tree search for the game of Scotland Yard , author=. 2011 IEEE Conference on Computational Intelligence and Games (CIG'11) , pages=. 2011 , organization=
work page 2011
-
[52]
Journal of Computational Finance , year=
Hedging of financial derivative contracts via Monte Carlo tree search , author=. Journal of Computational Finance , year=
-
[53]
FinRDDL: Can AI planning be used for quantitative finance problems? , author=. FinPlan 2023 , pages=
work page 2023
-
[54]
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. 2016 , publisher=
work page 2016
-
[55]
Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=. 2017 , publisher=
work page 2017
-
[56]
Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and others , journal=. 2018 , publisher=
work page 2018
-
[57]
Foundations and Trends in Machine Learning , volume=
Model-based reinforcement learning: A survey , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=
work page 2023
-
[58]
Discovering faster matrix multiplication algorithms with reinforcement learning , author=. Nature , volume=. 2022 , publisher=
work page 2022
-
[59]
Faster sorting algorithms discovered using deep reinforcement learning , author=. Nature , volume=. 2023 , publisher=
work page 2023
-
[60]
Dam, Tuan and Chalvatzaki, Georgia and Peters, Jan and Pajarinen, Joni , journal=. 2022 , publisher=
work page 2022
-
[61]
2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages=
Energy-aware multi-goal motion planning guided by monte carlo search , author=. 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages=. 2020 , organization=
work page 2020
- [62]
-
[63]
Sinclair, Sean R and Frujeri, Felipe Vieira and Cheng, Ching-An and Marshall, Luke and Barbalho, Hugo De Oliveira and Li, Jingling and Neville, Jennifer and Menache, Ishai and Swaminathan, Adith , booktitle=. 2023 , organization=
work page 2023
-
[64]
Mao, Hongzi and Venkatakrishnan, Shaileshh Bojja and Schwarzkopf, Malte and Alizadeh, Mohammad , journal=
-
[65]
Chong, Edwin KP and Givan, Robert L and Chang, Hyeong Soo , booktitle=. 2000 , organization=
work page 2000
-
[66]
Decision making under uncertainty: theory and application , author=. 2015 , publisher=
work page 2015
-
[67]
Efroni, Yonathan and Foster, Dylan J and Misra, Dipendra and Krishnamurthy, Akshay and Langford, John , booktitle=. 2022 , organization=
work page 2022
-
[68]
International Conference on Machine Learning , pages=
Discovering and removing exogenous state variables and rewards for reinforcement learning , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
- [69]
-
[70]
Journal of Machine Learning Research , volume=
Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=
-
[71]
Computer-Aided Design , volume=
Using Monte-Carlo variance reduction in statistical tolerance synthesis , author=. Computer-Aided Design , volume=. 1997 , publisher=
work page 1997
-
[72]
Hammersley, John Michael and Morton, Keith William , booktitle=. 1956 , organization=
work page 1956
-
[73]
International Journal of Reliability and Safety , volume=
Separable Monte Carlo combined with importance sampling for variance reduction , author=. International Journal of Reliability and Safety , volume=. 2013 , publisher=
work page 2013
-
[74]
Variance reduction three approaches to control variates , author=. 2007 , publisher=
work page 2007
-
[75]
Glynn, Peter W and Szechtman, Roberto , booktitle=. 2002 , organization=
work page 2002
-
[76]
Progress in Nuclear Energy , volume=
Monte Carlo variance reduction with deterministic importance functions , author=. Progress in Nuclear Energy , volume=. 2003 , publisher=
work page 2003
-
[77]
Science and Technology of Engineering, Chemistry and Environmental Protection , volume=
Variance Reduction in Monte Carlo Option Pricing: A Comparative Analysis of Control Variates, Multiple Control Variates and Antithetic Variates , author=. Science and Technology of Engineering, Chemistry and Environmental Protection , volume=
-
[78]
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing , volume=
A study of stratified sampling in variance reduction techniques for parametric yield estimation , author=. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing , volume=. 2002 , publisher=
work page 2002
-
[79]
Journal of Machine Learning Research , volume=
Monte carlo gradient estimation in machine learning , author=. Journal of Machine Learning Research , volume=
-
[80]
Lavenberg, Stephen S and Welch, Peter D , journal=. 1981 , publisher=
work page 1981
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.