pith. sign in

arxiv: 2604.13147 · v1 · submitted 2026-04-14 · 📊 stat.ML · cs.LG· math.PR

Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version

Pith reviewed 2026-05-10 14:17 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR
keywords stochastic controlnon-Markovian processesimportance samplingdeep neural networksadaptive learningMonte Carlo methodsdynamic programmingmodel uncertainty
0
0 comments X

The pith

A single fixed dataset of trajectories can recover optimal controls for non-Markovian systems with unknown parameters through importance sampling reweighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Monte Carlo method for solving continuous-time optimal control problems whose state dynamics are fully non-Markovian and depend on uncertain parameters. It constructs explicit reference probability laws under which a fixed collection of trajectories is simulated once; dynamic programming operators for any target model are then recovered by reweighting those trajectories with Radon-Nikodym factors. This off-model architecture supports both static approximation of the value function by deep neural networks and an adaptive scheme that updates the control law under parametric uncertainty by adjusting weights rather than regenerating paths. Non-asymptotic error bounds separate Monte Carlo sampling error from model-risk error, and the approach is illustrated on linear-quadratic examples with path-dependent features.

Core claim

We construct explicit dominating training laws and Radon-Nikodym weights for representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. For fixed parameters, non-asymptotic error bounds are established for deep neural network approximation of the embedded dynamic programming equation; for adaptive learning, quantitative estimates separate Monte Carlo approximation error from model-risk error.

What carries the argument

The dominating training law together with its Radon-Nikodym weight, which performs importance sampling to map the reference measure to the target measure inside the embedded backward dynamic programming recursion.

If this is right

  • Non-asymptotic error bounds hold for deep neural network approximation of the embedded dynamic programming equation when parameters are fixed.
  • Quantitative estimates separate Monte Carlo sampling error from model-risk error when the control law is adapted to changing parameters.
  • Recalibration to new parameter values requires only reweighting of the existing training sample.
  • The method applies directly to path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same off-model structure could reduce simulation cost in any path-dependent control setting where a dominating law with moderate variance can be exhibited.
  • Error separation offers a practical way to decide how much computational budget to allocate to additional sampling versus parameter estimation.
  • If dominating laws exist for wider classes of non-Markovian drivers, the architecture would extend beyond the linear-quadratic examples shown.

Load-bearing premise

Explicit dominating training laws and associated Radon-Nikodym weights with controlled variance can be constructed for the representative classes of fully non-Markovian controlled systems.

What would settle it

A concrete non-Markovian controlled system from one of the paper's representative classes in which every candidate dominating law produces importance-sampling weights whose variance grows without bound as the time horizon or discretization level increases.

Figures

Figures reproduced from arXiv: 2604.13147 by Adolfo M.D da Silva, Alberto Ohashi, Dorival Le\~ao, Simone Scotti.

Figure 1
Figure 1. Figure 1: Sample path of fractional Ornstein-Uhlenbeck on the interval [0, 1 12 ] with parameters of [PITH_FULL_IMAGE:figures/full_fig_p050_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical VarP&L as a function of the discretization level (Rough SV model with E[VT ] = 0.2. rtrain = 0.5) not explore a sufficiently rich range of controls and the resulting hedge appears too rigid. If rtrain is too large, the exploratory distribution becomes too diffuse and the regressions entering the dynamic programming step become noisier. The value rtrain = 0.5 seems to provide the best compromise b… view at source ↗
Figure 3
Figure 3. Figure 3: Histogram of P&L. Rough volatility. ATM Put with S0 = 100 = K. Number of Monte Carlo samples = 8000. rtrain = 0.5 5.3. A structured random-skeleton importance-sampling experiment under model risk. We now present a richer numerical illustration of the adaptive importance-sampling update of Sec￾tion 4.2. The purpose of this experiment is to illustrate the adaptive importance sampling mechanism in a very simp… view at source ↗
read the original abstract

This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper develops an off-model Monte Carlo learning method for continuous-time stochastic control problems with fully non-Markovian controlled states and unknown parameters. It constructs explicit dominating training laws and square-integrable Radon-Nikodym weights for representative classes (path-dependent SDEs, rough-volatility models, fBM-driven systems) via discrete skeleton embedding. This permits generation of a single fixed synthetic dataset under a reference measure, with target-model dynamic programming operators recovered by importance sampling. Non-asymptotic error bounds are derived for deep neural network approximation of the embedded backward equation under fixed parameters, together with quantitative estimates that separate Monte Carlo sampling error from model-risk error under adaptive parametric updates. The approach is illustrated on linear-quadratic examples.

Significance. If the explicit constructions and variance bounds hold, the work supplies a practical mechanism for adaptive recalibration without trajectory regeneration, while rigorously separating approximation and model-risk contributions. The provision of concrete dominating measures and Radon-Nikodym weights for several non-Markovian classes, together with the non-asymptotic DNN bounds and error-separation estimates, constitutes a concrete advance over generic importance-sampling arguments. These features support reproducible numerical implementation and falsifiable error predictions in structured settings.

minor comments (3)
  1. [Abstract] Abstract: the statement of non-asymptotic bounds does not indicate the dependence on network width, depth, or time horizon; adding a brief qualitative indication would improve readability without altering the technical content.
  2. [§1] §1 (Introduction): the reference to the 'discrete skeleton approach developed in earlier work' would benefit from an explicit citation or pointer to the relevant prior result to allow readers to locate the embedding construction.
  3. [Numerical experiments] Numerical experiments section: while the LQ examples illustrate the mechanism, quantitative reporting of the realized variance of the importance weights (or comparison to the derived bounds) would strengthen the empirical support for the controlled-variance claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were listed in the report, so we have no individual points to address. We remain available to incorporate any editorial suggestions or minor clarifications in a revised version.

Circularity Check

0 steps flagged

Minor self-citation to prior discrete skeleton method; central claims remain independent

full rationale

The derivation relies on explicit constructions of dominating training laws and Radon-Nikodym weights for path-dependent SDEs, rough-volatility, and fBM-driven systems, which are supplied in the paper via the discrete skeleton embedding rather than assumed or fitted. Non-asymptotic DNN approximation bounds and the separation of Monte Carlo versus model-risk errors in the adaptive setting follow directly from these weights and standard concentration arguments, without reducing to self-referential definitions or re-using fitted quantities as predictions. The single reference to earlier discrete skeleton work is not load-bearing for the new off-model architecture or error estimates, which are self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that dominating reference laws with usable Radon-Nikodym derivatives exist and can be written explicitly for the non-Markovian classes studied; no free parameters or new entities are mentioned in the abstract.

axioms (1)
  • domain assumption Existence of explicit dominating training laws and Radon-Nikodym weights for representative classes of fully non-Markovian controlled systems
    Required for the off-model training architecture to function without degeneracy.

pith-pipeline@v0.9.0 · 5546 in / 1452 out tokens · 71895 ms · 2026-05-10T14:17:24.385700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Abi Jaber and O

    E. Abi Jaber and O. El Euch, Multifactor approximation of rough volatility models,SIAM Journal on Financial Mathematics10(2019), 309–349

  2. [2]

    Agapiou, O

    S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso and A. M. Stuart, Importance Sampling: Intrinsic Dimension and Computational Cost,Statistical Science32(2017), 405-431

  3. [3]

    Alfonsi and A

    A. Alfonsi and A. Kebaier, Approximation of stochastic Volterra equations with kernels of completely monotone type,Mathematics of Computation93(2024), 643–677

  4. [4]

    Bach, Breaking the curse of dimensionality with convex neural networks,Journal of Machine Learning Research 18(2017), 1–53

    F. Bach, Breaking the curse of dimensionality with convex neural networks,Journal of Machine Learning Research 18(2017), 1–53

  5. [5]

    P. Bank, C. Bayer, P. P. Hager, S. Riedel, and T. Nauen,Stochastic Control with Signatures, arXiv:2406.01585, 2025

  6. [6]

    Bayer and S

    C. Bayer and S. Breneis, Markovian approximations of stochastic Volterra equations with the fractional kernel, Quantitative Finance23(2023), 53–70

  7. [7]

    Bayer, P

    C. Bayer, P. Friz, M. Fukasawa, J. Gatheral, A. Jacquier, and M. Rosenbaum (eds.),Rough Volatility, Financial Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, 2024. 74 DORIVAL LE ˜AO, ALBERTO OHASHI, SIMONE SCOTTI, AND ADOLFO M. DIAS DA SILVA

  8. [8]

    Bayraktar and T

    E. Bayraktar and T. Chen,Nonparametric Adaptive Robust Control Under Model Uncertainty,SIAM Journal on Control and Optimization61(2023), no. 5, 2737–2760

  9. [9]

    Bayraktar and T

    E. Bayraktar and T. Chen, Data-driven non-parametric robust control under dependence uncertainty, inPeter Carr Gedenkschrift: Research Advances in Mathematical Finance, World Scientific, 2024, pp. 141–178

  10. [10]

    Bertsekas

    D. Bertsekas. Dynamic Programming and Optimal Control

  11. [11]

    A. N. Borodin and P. Salminen,Handbook of Brownian Motion: Facts and Formulae, Birkh¨ auser, 2002

  12. [12]

    Z. A. Burq and O. D. Jones, Simulation of Brownian motion at first-passage times,Mathematics and Computers in Simulation77(2008), no. 1, 64–71

  13. [13]

    Carlin,Deep Learning Architectures, Springer

    O. Carlin,Deep Learning Architectures, Springer

  14. [14]

    Chakraborty, H

    P. Chakraborty, H. Honnappa, and S. Tindel,Pathwise Relaxed Optimal Control of Rough Differential Equations, arXiv:2402.17900, 2024

  15. [15]

    Chen and J

    T. Chen and J. Myung,Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty, Preprint, 2020

  16. [16]

    Cheridito, H

    P. Cheridito, H. Kawaguchi, and M. Maejima, Fractional Ornstein–Uhlenbeck processes,Electronic Journal of Probability8(2003), no. 3, 14 pp

  17. [17]

    El Euch and M

    O. El Euch and M. Rosenbaum, Perfect hedging in rough Heston models,Annals of Applied Probability28(2018), 3813–3856

  18. [18]

    Fukasawa, Hedging under rough volatility

    M. Fukasawa, Hedging under rough volatility

  19. [19]

    Gatheral, T

    J. Gatheral, T. Jaisson, and M. Rosenbaum, Volatility is rough,Quantitative Finance18(2018), no. 6, 933–949

  20. [20]

    Gobet and P

    E. Gobet and P. Turkedjiev, Adaptive importance sampling in least-squares Monte Carlo algorithms for backward stochastic differential equations,Stochastic Processes and their Applications127(2017), no. 4, 1171–1203

  21. [21]

    P. P. Hager, F. N. Harang, L. Pelizzari, and S. Tindel,The Volterra signature, arXiv:2603.04525, 2026

  22. [22]

    J. P. Hanna, S. Niekum, and P. Stone, Importance sampling in reinforcement learning with an estimated behavior policy,Machine Learning110(2021), 1267–1317

  23. [23]

    Horvath, J

    B. Horvath, J. Teichmann, and Z. Zuric, Deep hedging under rough volatility,Risks9(2021), no. 7, 138

  24. [24]

    Hur´ e, H

    C. Hur´ e, H. Pham, A. Bachouch, and N. Langren´ e, Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis,SIAM Journal on Numerical Analysis59(2021), no. 1, 525–557

  25. [25]

    Kohler, Nonparametric regression with additional measurement errors in the dependent variable,Journal of Statistical Planning and Inference136(2006), 3339–3361

    M. Kohler, Nonparametric regression with additional measurement errors in the dependent variable,Journal of Statistical Planning and Inference136(2006), 3339–3361

  26. [26]

    Kohler, A

    M. Kohler, A. Krzy˙ zak, and N. Todorovi´ c, Pricing of high-dimensional American options by neural networks, Mathematical Finance20(2010), 383–410

  27. [27]

    Le˜ ao and A

    D. Le˜ ao and A. Ohashi, Weak approximations for Wiener functionals,Annals of Applied Probability23(2013), no. 4, 1660–1691

  28. [28]

    Le˜ ao, A

    D. Le˜ ao, A. Ohashi, and A. B. Simas, A weak version of path-dependent functional Itˆ o calculus,Annals of Proba- bility46(2018), no. 6, 3399–3441

  29. [29]

    Le˜ ao, A

    D. Le˜ ao, A. Ohashi, and F. Russo, Discrete-type approximations for non-Markovian optimal stopping problems: Part I,Journal of Applied Probability56(2019), no. 4, 981–1005

  30. [30]

    Le˜ ao, A

    D. Le˜ ao, A. Ohashi, and F. A. de Souza, Solving non-Markovian stochastic control problems driven by Wiener functionals,Annals of Applied Probability34(2024), 5116–5171

  31. [31]

    Ledoux and M

    M. Ledoux and M. Talagrand,Probability in Banach Spaces

  32. [32]

    Motte and D

    E. Motte and D. Hainaut, Partial hedging in rough volatility models,SIAM Journal on Financial Mathematics15 (2024), no. 3, 601–652

  33. [33]

    Ohashi and F

    A. Ohashi and F. A. de Souza,L p uniform random walk-type approximation for fractional Brownian motion with Hurst exponent 0< H < 1 2 ,Electronic Communications in Probability25(2020), 1–13

  34. [34]

    Riedel, The value of the high, low and close in the estimation of Brownian motion,Statistical Inference for Stochastic Processes24(2021), 179–210

    K. Riedel, The value of the high, low and close in the estimation of Brownian motion,Statistical Inference for Stochastic Processes24(2021), 179–210

  35. [35]

    H. J. Kappen and H. C. Ruiz, Adaptive importance sampling for control and inference,Journal of Statistical Physics162(2016), 1244–1266. Departamento de Matem´atica Aplicada e Estat´ıstica. Universidade de S ˜ao Paulo, 13560-970, S˜ao Carlos - SP, Brazil Email address:leao@estatcamp.com.br Departamento de Matem´atica, Universidade de Bras´ılia, 13560-970, ...