Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version
Pith reviewed 2026-05-10 14:17 UTC · model grok-4.3
The pith
A single fixed dataset of trajectories can recover optimal controls for non-Markovian systems with unknown parameters through importance sampling reweighting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We construct explicit dominating training laws and Radon-Nikodym weights for representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. For fixed parameters, non-asymptotic error bounds are established for deep neural network approximation of the embedded dynamic programming equation; for adaptive learning, quantitative estimates separate Monte Carlo approximation error from model-risk error.
What carries the argument
The dominating training law together with its Radon-Nikodym weight, which performs importance sampling to map the reference measure to the target measure inside the embedded backward dynamic programming recursion.
If this is right
- Non-asymptotic error bounds hold for deep neural network approximation of the embedded dynamic programming equation when parameters are fixed.
- Quantitative estimates separate Monte Carlo sampling error from model-risk error when the control law is adapted to changing parameters.
- Recalibration to new parameter values requires only reweighting of the existing training sample.
- The method applies directly to path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion.
Where Pith is reading between the lines
- The same off-model structure could reduce simulation cost in any path-dependent control setting where a dominating law with moderate variance can be exhibited.
- Error separation offers a practical way to decide how much computational budget to allocate to additional sampling versus parameter estimation.
- If dominating laws exist for wider classes of non-Markovian drivers, the architecture would extend beyond the linear-quadratic examples shown.
Load-bearing premise
Explicit dominating training laws and associated Radon-Nikodym weights with controlled variance can be constructed for the representative classes of fully non-Markovian controlled systems.
What would settle it
A concrete non-Markovian controlled system from one of the paper's representative classes in which every candidate dominating law produces importance-sampling weights whose variance grows without bound as the time horizon or discretization level increases.
Figures
read the original abstract
This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops an off-model Monte Carlo learning method for continuous-time stochastic control problems with fully non-Markovian controlled states and unknown parameters. It constructs explicit dominating training laws and square-integrable Radon-Nikodym weights for representative classes (path-dependent SDEs, rough-volatility models, fBM-driven systems) via discrete skeleton embedding. This permits generation of a single fixed synthetic dataset under a reference measure, with target-model dynamic programming operators recovered by importance sampling. Non-asymptotic error bounds are derived for deep neural network approximation of the embedded backward equation under fixed parameters, together with quantitative estimates that separate Monte Carlo sampling error from model-risk error under adaptive parametric updates. The approach is illustrated on linear-quadratic examples.
Significance. If the explicit constructions and variance bounds hold, the work supplies a practical mechanism for adaptive recalibration without trajectory regeneration, while rigorously separating approximation and model-risk contributions. The provision of concrete dominating measures and Radon-Nikodym weights for several non-Markovian classes, together with the non-asymptotic DNN bounds and error-separation estimates, constitutes a concrete advance over generic importance-sampling arguments. These features support reproducible numerical implementation and falsifiable error predictions in structured settings.
minor comments (3)
- [Abstract] Abstract: the statement of non-asymptotic bounds does not indicate the dependence on network width, depth, or time horizon; adding a brief qualitative indication would improve readability without altering the technical content.
- [§1] §1 (Introduction): the reference to the 'discrete skeleton approach developed in earlier work' would benefit from an explicit citation or pointer to the relevant prior result to allow readers to locate the embedding construction.
- [Numerical experiments] Numerical experiments section: while the LQ examples illustrate the mechanism, quantitative reporting of the realized variance of the importance weights (or comparison to the derived bounds) would strengthen the empirical support for the controlled-variance claim.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were listed in the report, so we have no individual points to address. We remain available to incorporate any editorial suggestions or minor clarifications in a revised version.
Circularity Check
Minor self-citation to prior discrete skeleton method; central claims remain independent
full rationale
The derivation relies on explicit constructions of dominating training laws and Radon-Nikodym weights for path-dependent SDEs, rough-volatility, and fBM-driven systems, which are supplied in the paper via the discrete skeleton embedding rather than assumed or fitted. Non-asymptotic DNN approximation bounds and the separation of Monte Carlo versus model-risk errors in the adaptive setting follow directly from these weights and standard concentration arguments, without reducing to self-referential definitions or re-using fitted quantities as predictions. The single reference to earlier discrete skeleton work is not load-bearing for the new off-model architecture or error estimates, which are self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of explicit dominating training laws and Radon-Nikodym weights for representative classes of fully non-Markovian controlled systems
Reference graph
Works this paper leans on
-
[1]
E. Abi Jaber and O. El Euch, Multifactor approximation of rough volatility models,SIAM Journal on Financial Mathematics10(2019), 309–349
work page 2019
-
[2]
S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso and A. M. Stuart, Importance Sampling: Intrinsic Dimension and Computational Cost,Statistical Science32(2017), 405-431
work page 2017
-
[3]
A. Alfonsi and A. Kebaier, Approximation of stochastic Volterra equations with kernels of completely monotone type,Mathematics of Computation93(2024), 643–677
work page 2024
-
[4]
F. Bach, Breaking the curse of dimensionality with convex neural networks,Journal of Machine Learning Research 18(2017), 1–53
work page 2017
- [5]
-
[6]
C. Bayer and S. Breneis, Markovian approximations of stochastic Volterra equations with the fractional kernel, Quantitative Finance23(2023), 53–70
work page 2023
- [7]
-
[8]
E. Bayraktar and T. Chen,Nonparametric Adaptive Robust Control Under Model Uncertainty,SIAM Journal on Control and Optimization61(2023), no. 5, 2737–2760
work page 2023
-
[9]
E. Bayraktar and T. Chen, Data-driven non-parametric robust control under dependence uncertainty, inPeter Carr Gedenkschrift: Research Advances in Mathematical Finance, World Scientific, 2024, pp. 141–178
work page 2024
- [10]
-
[11]
A. N. Borodin and P. Salminen,Handbook of Brownian Motion: Facts and Formulae, Birkh¨ auser, 2002
work page 2002
-
[12]
Z. A. Burq and O. D. Jones, Simulation of Brownian motion at first-passage times,Mathematics and Computers in Simulation77(2008), no. 1, 64–71
work page 2008
-
[13]
Carlin,Deep Learning Architectures, Springer
O. Carlin,Deep Learning Architectures, Springer
-
[14]
P. Chakraborty, H. Honnappa, and S. Tindel,Pathwise Relaxed Optimal Control of Rough Differential Equations, arXiv:2402.17900, 2024
-
[15]
T. Chen and J. Myung,Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty, Preprint, 2020
work page 2020
-
[16]
P. Cheridito, H. Kawaguchi, and M. Maejima, Fractional Ornstein–Uhlenbeck processes,Electronic Journal of Probability8(2003), no. 3, 14 pp
work page 2003
-
[17]
O. El Euch and M. Rosenbaum, Perfect hedging in rough Heston models,Annals of Applied Probability28(2018), 3813–3856
work page 2018
- [18]
-
[19]
J. Gatheral, T. Jaisson, and M. Rosenbaum, Volatility is rough,Quantitative Finance18(2018), no. 6, 933–949
work page 2018
-
[20]
E. Gobet and P. Turkedjiev, Adaptive importance sampling in least-squares Monte Carlo algorithms for backward stochastic differential equations,Stochastic Processes and their Applications127(2017), no. 4, 1171–1203
work page 2017
-
[21]
P. P. Hager, F. N. Harang, L. Pelizzari, and S. Tindel,The Volterra signature, arXiv:2603.04525, 2026
work page internal anchor Pith review arXiv 2026
-
[22]
J. P. Hanna, S. Niekum, and P. Stone, Importance sampling in reinforcement learning with an estimated behavior policy,Machine Learning110(2021), 1267–1317
work page 2021
-
[23]
B. Horvath, J. Teichmann, and Z. Zuric, Deep hedging under rough volatility,Risks9(2021), no. 7, 138
work page 2021
- [24]
-
[25]
M. Kohler, Nonparametric regression with additional measurement errors in the dependent variable,Journal of Statistical Planning and Inference136(2006), 3339–3361
work page 2006
- [26]
-
[27]
D. Le˜ ao and A. Ohashi, Weak approximations for Wiener functionals,Annals of Applied Probability23(2013), no. 4, 1660–1691
work page 2013
- [28]
- [29]
- [30]
- [31]
-
[32]
E. Motte and D. Hainaut, Partial hedging in rough volatility models,SIAM Journal on Financial Mathematics15 (2024), no. 3, 601–652
work page 2024
-
[33]
A. Ohashi and F. A. de Souza,L p uniform random walk-type approximation for fractional Brownian motion with Hurst exponent 0< H < 1 2 ,Electronic Communications in Probability25(2020), 1–13
work page 2020
-
[34]
K. Riedel, The value of the high, low and close in the estimation of Brownian motion,Statistical Inference for Stochastic Processes24(2021), 179–210
work page 2021
-
[35]
H. J. Kappen and H. C. Ruiz, Adaptive importance sampling for control and inference,Journal of Statistical Physics162(2016), 1244–1266. Departamento de Matem´atica Aplicada e Estat´ıstica. Universidade de S ˜ao Paulo, 13560-970, S˜ao Carlos - SP, Brazil Email address:leao@estatcamp.com.br Departamento de Matem´atica, Universidade de Bras´ılia, 13560-970, ...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.