pith. sign in

arxiv: 2605.04732 · v1 · submitted 2026-05-06 · 💻 cs.LG

Using Common Random Numbers for Simulation-based Planning with Rollouts

Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords common random numbersrollout planningvariance reductionsimulation-based planningstochastic environmentsUCTMonte Carlo planning
0
0 comments X

The pith

Using common random numbers in rollout simulations provably reduces variance in relative utility estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how sharing the same random numbers when simulating different actions in rollout-based planning lowers the variance of their estimated utility differences. This matters because more stable relative comparisons let the planner pick better actions in stochastic settings without extra samples or computation. The reduction is provable once simulations switch to a rollout policy after an initial depth. Synthetic experiments and two applications, single-step planning for pension disbursement and UCT in Ludo, show the scheme improves final task performance.

Core claim

When the sampling model invokes a rollout policy beyond some depth, applying common random numbers across the trajectory generations for different actions yields a strictly lower variance for the relative utility estimates, improving action selection in the planning loop.

What carries the argument

Common random numbers applied to the sampling model so that trajectories for competing actions share the same randomness and produce correlated utility estimates.

Load-bearing premise

The sampling model must generate trajectories whose relative utilities can be compared under shared randomness without introducing bias.

What would settle it

An experiment that measures the variance of the difference between two action utilities and finds no reduction (or an increase) when common random numbers replace independent draws in the rollout phase.

Figures

Figures reproduced from arXiv: 2605.04732 by Frederic J Maliakkal, Harshad Khadilkar, Sandarbh Yadav, Shivaram Kalyanakrishnan.

Figure 1
Figure 1. Figure 1: Figure (a) shows the MDP defined in the proof of Proposition view at source ↗
Figure 2
Figure 2. Figure 2: Performance metrics against the number of simulations on synthetic tasks. Results (here view at source ↗
Figure 3
Figure 3. Figure 3: Figure (a) explains the sequence of steps in the FTVAF task, while Figure (b) records the view at source ↗
Figure 4
Figure 4. Figure 4: Ludo: Figure (a) shows the board, and Figure (b) the performance of simulation-based view at source ↗
read the original abstract

Simulation-based planning with rollouts is a widely-deployed technique for decision making in stochastic environments. The primary instrument of simulation-based planning is a sampling model, which is repeatedly called to generate trajectories and estimate the utilities of available actions. Among the actions thus explored, one with the maximum estimated utility is then executed. In this paper, we examine the effect of using common random numbers in the simulation process. We obtain a simple recipe for (provably) reducing variance in relative utility when simulations invoke a rollout policy beyond some depth. Experiments on synthetic tasks confirm that our scheme improves task performance. The broader significance of our innovation is apparent from two practical applications: (1) single-step lookahead planning in a pension-disbursement task, and (2) a deployment of the well-known UCT algorithm for the game of Ludo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript proposes applying common random numbers (CRN) to the post-depth rollout phase of sampling models in simulation-based planning. It claims this yields a simple, provable reduction in the variance of relative action utilities (without biasing comparisons), leading to better action selection. The claim is supported by experiments on synthetic tasks showing improved performance, plus two applications: single-step lookahead in a pension-disbursement task and UCT for Ludo.

Significance. If the variance-reduction claim holds under the stated conditions, the technique offers a low-cost way to improve sample efficiency in rollout-based planners such as UCT/MCTS variants. The two practical applications provide concrete evidence of utility beyond synthetic benchmarks. The work correctly identifies and exploits the controllable randomness already present in many simulators.

minor comments (4)
  1. [§3] §3 (or wherever the variance argument appears): the derivation would benefit from an explicit side-by-side comparison of Var(U_i - U_j) under independent sampling versus CRN, including the covariance term, to make the reduction factor transparent.
  2. [Experiments] Experiments section: report the number of independent trials, standard errors, and any statistical tests for the performance gains on synthetic tasks and Ludo; without these the magnitude of improvement is hard to judge.
  3. [Notation/§2] Notation: introduce the rollout-depth parameter d explicitly when first used and keep its symbol consistent throughout.
  4. [Figures] Figure captions: ensure each figure states the number of simulations per action and whether error bars represent standard error or deviation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the core contribution regarding common random numbers for variance reduction in post-depth rollouts.

Circularity Check

0 steps flagged

No significant circularity; standard CRN variance reduction applied to rollouts

full rationale

The paper's central claim is a direct application of the well-known common random numbers (CRN) technique to reduce variance of relative utilities in post-depth rollouts. This follows from the standard property that shared randomness makes the difference of two estimators have lower variance than independent sampling, without bias, provided the simulator exposes controllable randomness. No equations reduce to self-definition, no fitted parameters are relabeled as predictions, and no self-citation chain or uniqueness theorem is invoked to force the result. The derivation is self-contained against external simulation benchmarks and probabilistic facts about CRN.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on standard assumptions of stochastic simulation models.

axioms (1)
  • domain assumption The environment admits a sampling model that can generate trajectories under shared randomness.
    Required for any simulation-based planning method described.

pith-pipeline@v0.9.0 · 5447 in / 1111 out tokens · 28150 ms · 2026-05-08T18:24:25.733346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages

  1. [1]

    2015 , publisher=

    Goals-based wealth management: An integrated and practical approach to changing the structure of wealth advisory practices , author=. 2015 , publisher=

  2. [2]

    International conference on machine learning , pages=

    Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

  3. [3]

    Journal of political economy , volume=

    The pricing of options and corporate liabilities , author=. Journal of political economy , volume=. 1973 , publisher=

  4. [4]

    1992 , publisher=

    Aalen, Odd O , journal=. 1992 , publisher=

  5. [5]

    2006 , address =

    Elizabeth Arias , title =. 2006 , address =

  6. [6]

    Journal of portfolio management , volume=

    The sharpe ratio , author=. Journal of portfolio management , volume=. 1994 , publisher=

  7. [7]

    Nature , volume=

    Human-level control through deep reinforcement learning , author=. Nature , volume=. 2015 , publisher=

  8. [8]

    1997 , publisher=

    Monte Carlo Simulation , author=. 1997 , publisher=

  9. [9]

    Advances in neural information processing systems , volume=

    Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

  10. [10]

    Journal of financial and quantitative analysis , volume=

    An analytic derivation of the efficient portfolio frontier , author=. Journal of financial and quantitative analysis , volume=. 1972 , publisher=

  11. [11]

    Computational Management Science , volume=

    Dynamic portfolio allocation in goals-based wealth management , author=. Computational Management Science , volume=. 2020 , publisher=

  12. [12]

    2012 , publisher=

    Dynamic programming and optimal control: Volume I , author=. 2012 , publisher=

  13. [13]

    2007 , publisher=

    Hidden Markov Models in Finance , author=. 2007 , publisher=

  14. [14]

    2011 , publisher=

    B. 2011 , publisher=

  15. [15]

    Sutton, Richard S and Barto, Andrew G , year=

  16. [16]

    2024 , organization=

    Das, Sanjiv R and Ostrov, Daniel and Mittal, Sukrit and Radhakrishnan, Anand and Srivastav, Deep Ratna and Wang, Hungjen , booktitle=. 2024 , organization=

  17. [17]

    2011 , publisher=

    Bacinello, Anna Rita and Millossovich, Pietro and Olivieri, Annamaria and Pitacco, Ermanno , journal=. 2011 , publisher=

  18. [18]

    The Review of Economics and Statistics , urldate =

    Lifetime Portfolio Selection By Dynamic Stochastic Programming , author =. The Review of Economics and Statistics , urldate =. 1969 , pages =

  19. [19]

    2018 , title =

    Das, Sanjiv Ranjan and Ostrov, Daniel N and Radhakrishnan, Anand and Srivastav, Deep , journal =. 2018 , title =. doi:10.2139/ssrn.3117765 , url =

  20. [20]

    Journal of Banking & Finance , author =

    Dynamic optimization for multi-goals wealth management , author =. Journal of Banking; Finance , publisher =. 2022 , month =. doi:10.1016/j.jbankfin.2021.106192 , url =

  21. [21]

    Jucker and Jorge Alberto Garcia Gomez , volume =

    James V. Jucker and Jorge Alberto Garcia Gomez , volume =. 1975 , pages =

  22. [22]

    Scientific Reports , publisher =

    Quantifying the randomness of the stock markets , author =. Scientific Reports , publisher =. 2019 , month =. doi:10.1038/s41598-019-49320-9 , url =

  23. [23]

    and Fleet, David J

    Wang, Jack M. and Fleet, David J. and Hertzmann, Aaron , number =. ACM Transactions on Graphics , publisher =. 2010 , title =

  24. [24]

    , publisher =

    Spall, James C. , publisher =. 2003 , title =

  25. [25]

    Ng, Andrew and Jordan, Michael , booktitle=

  26. [26]

    Annual Review of Statistics and Its Application , volume=

    A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=

  27. [27]

    Available at SSRN 5289956 , year=

    A Pre-trained Reinforcement Learning Approach to Goals-Based Wealth Management , author=. Available at SSRN 5289956 , year=

  28. [28]

    Expert systems with applications , volume=

    Decision-making for financial trading: A fusion approach of machine learning and portfolio selection , author=. Expert systems with applications , volume=. 2019 , publisher=

  29. [29]

    arXiv preprint arXiv:1301.7380 , year=

    Solving POMDPs by searching in policy space , author=. arXiv preprint arXiv:1301.7380 , year=

  30. [30]

    Blackmore and B

    L. Blackmore and B. Williams , journal =. 2007 , title =

  31. [31]

    2003 , title =

    Andrew Ng and H-jin Kim and Michael Jordan and Shankar Sastry , journal =. 2003 , title =

  32. [32]

    Yao , volume =

    Paul Glasserman and David D. Yao , volume =. 1992 , pages =

  33. [33]

    2017 , title =

    Tim Salimans and Jonathan Ho and Xi Chen and Ilya Sutskever , volume =. 2017 , title =

  34. [34]

    2021 , pages =

    Recent advances in reinforcement learning in finance , author =. 2021 , pages =

  35. [35]

    2015 , title =

    Phelim, Boyle and Mary, Hardy and Anne, MacKay and David, Saunders , journal =. 2015 , title =

  36. [36]

    2003 , title =

    Ng, Andrew , journal =. 2003 , title =

  37. [37]

    2016 , title =

    Volodymyr Mnih and Adri. 2016 , title =

  38. [38]

    2021 , pages =

    Stable-Baselines3: Reliable Reinforcement Learning Implementations , author =. 2021 , pages =

  39. [39]

    1952 , pages =

    PORTFOLIO SELECTION , author =. 1952 , pages =. doi:https://doi.org/10.1111/j.1540-6261.1952.tb01525.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-6261.1952.tb01525.x , number =

  40. [40]

    and Saffell, M

    Moody, J. and Saffell, M. , journal=. Learning to trade via direct reinforcement , year=

  41. [41]

    Sharing Longevity Risk: Why Governments Should Issue Longevity Bonds , volume =

    Blake, David and Boardman, Tom and Cairns, Andrew , year =. Sharing Longevity Risk: Why Governments Should Issue Longevity Bonds , volume =. North American Actuarial Journal , doi =

  42. [42]

    Reinforcement learning for optimized trade execution , volume =

    Nevmyvaka, Yuriy and Feng, Yi and Kearns, Michael , year =. Reinforcement learning for optimized trade execution , volume =. ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning , doi =

  43. [43]

    2000 , journal =

    Optimal execution of portfolio trans-actions , author=. 2000 , journal =

  44. [44]

    2017 , eprint=

    A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem , author=. 2017 , eprint=

  45. [45]

    2019 , eprint=

    QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds , author=. 2019 , eprint=

  46. [46]

    2022 , eprint=

    Deep Hedging: Continuous Reinforcement Learning for Hedging of General Portfolios across Multiple Risk Aversions , author=. 2022 , eprint=

  47. [47]

    2011 , organization=

    Alvi, Faisal and Ahmed, Moataz , booktitle=. 2011 , organization=

  48. [48]

    2012 IEEE Conference on Computational Intelligence and Games (CIG) , pages=

    TD ( ) and Q-learning based Ludo players , author=. 2012 IEEE Conference on Computational Intelligence and Games (CIG) , pages=. 2012 , organization=

  49. [49]

    2023 7th IEEE Congress on Information Science and Technology (CiSt) , pages=

    Incorporating Feature Penalty in Reinforcement Learning for Ludo Game , author=. 2023 7th IEEE Congress on Information Science and Technology (CiSt) , pages=. 2023 , organization=

  50. [50]

    Vittori, Edoardo and Likmeta, Amarildo and Restelli, Marcello , booktitle=

  51. [51]

    2011 IEEE Conference on Computational Intelligence and Games (CIG'11) , pages=

    Monte-Carlo tree search for the game of Scotland Yard , author=. 2011 IEEE Conference on Computational Intelligence and Games (CIG'11) , pages=. 2011 , organization=

  52. [52]

    Journal of Computational Finance , year=

    Hedging of financial derivative contracts via Monte Carlo tree search , author=. Journal of Computational Finance , year=

  53. [53]

    FinPlan 2023 , pages=

    FinRDDL: Can AI planning be used for quantitative finance problems? , author=. FinPlan 2023 , pages=

  54. [54]

    2016 , publisher=

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. 2016 , publisher=

  55. [55]

    2017 , publisher=

    Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=. 2017 , publisher=

  56. [56]

    2018 , publisher=

    Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and others , journal=. 2018 , publisher=

  57. [57]

    Foundations and Trends in Machine Learning , volume=

    Model-based reinforcement learning: A survey , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

  58. [58]

    Nature , volume=

    Discovering faster matrix multiplication algorithms with reinforcement learning , author=. Nature , volume=. 2022 , publisher=

  59. [59]

    Nature , volume=

    Faster sorting algorithms discovered using deep reinforcement learning , author=. Nature , volume=. 2023 , publisher=

  60. [60]

    2022 , publisher=

    Dam, Tuan and Chalvatzaki, Georgia and Peters, Jan and Pajarinen, Joni , journal=. 2022 , publisher=

  61. [61]

    2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages=

    Energy-aware multi-goal motion planning guided by monte carlo search , author=. 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages=. 2020 , organization=

  62. [62]

    Sorensen , title =

    Simon L.B. Sorensen , title =. 2023 , howpublished =

  63. [63]

    2023 , organization=

    Sinclair, Sean R and Frujeri, Felipe Vieira and Cheng, Ching-An and Marshall, Luke and Barbalho, Hugo De Oliveira and Li, Jingling and Neville, Jennifer and Menache, Ishai and Swaminathan, Adith , booktitle=. 2023 , organization=

  64. [64]

    Mao, Hongzi and Venkatakrishnan, Shaileshh Bojja and Schwarzkopf, Malte and Alizadeh, Mohammad , journal=

  65. [65]

    2000 , organization=

    Chong, Edwin KP and Givan, Robert L and Chang, Hyeong Soo , booktitle=. 2000 , organization=

  66. [66]

    2015 , publisher=

    Decision making under uncertainty: theory and application , author=. 2015 , publisher=

  67. [67]

    2022 , organization=

    Efroni, Yonathan and Foster, Dylan J and Misra, Dipendra and Krishnamurthy, Akshay and Langford, John , booktitle=. 2022 , organization=

  68. [68]

    International Conference on Machine Learning , pages=

    Discovering and removing exogenous state variables and rewards for reinforcement learning , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  69. [69]

    2008 , publisher=

    Stout, Natasha K and Goldie, Sue J , journal=. 2008 , publisher=

  70. [70]

    Journal of Machine Learning Research , volume=

    Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=

  71. [71]

    Computer-Aided Design , volume=

    Using Monte-Carlo variance reduction in statistical tolerance synthesis , author=. Computer-Aided Design , volume=. 1997 , publisher=

  72. [72]

    1956 , organization=

    Hammersley, John Michael and Morton, Keith William , booktitle=. 1956 , organization=

  73. [73]

    International Journal of Reliability and Safety , volume=

    Separable Monte Carlo combined with importance sampling for variance reduction , author=. International Journal of Reliability and Safety , volume=. 2013 , publisher=

  74. [74]

    2007 , publisher=

    Variance reduction three approaches to control variates , author=. 2007 , publisher=

  75. [75]

    2002 , organization=

    Glynn, Peter W and Szechtman, Roberto , booktitle=. 2002 , organization=

  76. [76]

    Progress in Nuclear Energy , volume=

    Monte Carlo variance reduction with deterministic importance functions , author=. Progress in Nuclear Energy , volume=. 2003 , publisher=

  77. [77]

    Science and Technology of Engineering, Chemistry and Environmental Protection , volume=

    Variance Reduction in Monte Carlo Option Pricing: A Comparative Analysis of Control Variates, Multiple Control Variates and Antithetic Variates , author=. Science and Technology of Engineering, Chemistry and Environmental Protection , volume=

  78. [78]

    IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing , volume=

    A study of stratified sampling in variance reduction techniques for parametric yield estimation , author=. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing , volume=. 2002 , publisher=

  79. [79]

    Journal of Machine Learning Research , volume=

    Monte carlo gradient estimation in machine learning , author=. Journal of Machine Learning Research , volume=

  80. [80]

    1981 , publisher=

    Lavenberg, Stephen S and Welch, Peter D , journal=. 1981 , publisher=

Showing first 80 references.