Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

Luke Snow; Vikram Krishnamurthy

arxiv: 2604.01345 · v2 · submitted 2026-04-01 · 💻 cs.LG

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

Vikram Krishnamurthy , Luke Snow This is my paper

Pith reviewed 2026-05-13 22:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords inverse reinforcement learningadaptive IRLMalliavin calculuscounterfactual gradientsLangevin dynamicsSkorohod integralgradient estimationpassive learning

0 comments

The pith

Malliavin calculus reformulates counterfactual gradients as ratios of expectations to achieve efficient estimation in adaptive inverse reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a passive algorithm for adaptive inverse reinforcement learning that reconstructs a forward learner's loss function by observing its gradients during reinforcement learning. The central difficulty is that the needed gradients are counterfactual, conditioned on events of probability zero along the observed trajectory, which makes naive sampling useless and kernel smoothing slow. By applying Malliavin calculus to a general Langevin structure, the authors rewrite the conditional expectations as ratios of ordinary unconditioned expectations that involve explicit Malliavin derivatives and their Skorohod integral adjoints. This change restores standard Monte Carlo convergence rates and yields a concrete estimation procedure. A reader cares because the approach lets one recover reward functions from black-box learners without ever intervening in their training.

Core claim

For forward learners that obey a general Langevin diffusion, the required counterfactual gradient equals the ratio of two unconditioned expectations, each built from Malliavin derivatives of the state process and the adjoint Skorohod integral of the test function. Direct Monte Carlo sampling of these quantities therefore produces consistent estimators whose error decays at the usual parametric rate.

What carries the argument

Malliavin calculus reformulation that converts counterfactual conditioning into a ratio of unconditioned expectations via Malliavin derivatives and Skorohod integrals for Langevin diffusions.

If this is right

Adaptive IRL can now run with passive observations and standard Monte Carlo rates instead of kernel smoothing.
The same derivative formulas apply to any forward learner whose dynamics match the assumed Langevin form.
Explicit algorithmic recipes are given for evaluating the Malliavin quantities in practice.
The resulting gradient estimates can be plugged directly into existing inverse reinforcement learning updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could extend to other counterfactual estimation tasks in stochastic optimization whenever similar derivative operators are available.
In high-dimensional continuous control problems, the method may scale better than kernel approaches because it avoids nonparametric bandwidth selection.
A direct test would compare empirical convergence curves on simulated Langevin agents against the predicted 1/sqrt(N) rate.
Discrete-time or jump-process learners would require analogous stochastic calculus operators to obtain the same ratio form.

Load-bearing premise

The forward learner must obey a Langevin diffusion structure for which the needed Malliavin derivatives and Skorohod integrals exist and can be written down explicitly.

What would settle it

Simulate a known Langevin process, compute the proposed estimator for increasing sample sizes N, and check whether the observed error decays proportionally to 1 over square root of N; slower decay would show the reformulation fails to remove the conditioning.

Figures

Figures reproduced from arXiv: 2604.01345 by Luke Snow, Vikram Krishnamurthy.

**Figure 4.** Figure 4: Adaptive IRL for reconstructing the loss function [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is rewriting counterfactual gradients in passive adaptive IRL as a ratio of unconditional expectations using Malliavin derivatives and Skorohod integrals for Langevin dynamics.

read the letter

The core idea here is to handle the zero-probability conditioning problem in adaptive IRL by turning the required counterfactual gradients into a ratio of ordinary expectations that involve Malliavin derivatives and their adjoints. This is meant to restore standard Monte Carlo convergence rates instead of relying on kernel smoothing. The authors derive the necessary Malliavin quantities explicitly for a general Langevin forward process and sketch an algorithm that uses them. That reformulation is the actual new piece, and it sits at the intersection of stochastic analysis and IRL in a way that prior work on adaptive IRL has not done. If the derivations hold up, it gives a cleaner path for passive observation of gradients during RL training. The technical step of pulling out the Skorohod integral formulations for the general case is non-trivial and worth the effort they put in. The soft spot is exactly the one flagged in the stress test. For arbitrary nonlinear, state-dependent drift and diffusion coefficients, the first-variation process and the inverse of the Malliavin covariance matrix rarely admit simple closed forms without extra conditions such as global Lipschitz continuity or uniform ellipticity. The abstract claims the derivation works for a general Langevin structure, but it does not list those restrictions or show the explicit expressions. If those conditions turn out to be necessary, the claimed generality and rate recovery shrink. No empirical checks or error bounds appear in the abstract either, so the practical payoff is still untested. This is for people working on inverse RL, counterfactual estimation, or continuous-time stochastic control who already know Malliavin calculus. It shows honest engagement with the relevant math literature rather than hand-waving. I would send it to referees so the derivations and any hidden assumptions can be examined in detail.

Referee Report

2 major / 1 minor

Summary. The paper proposes a passive algorithm for adaptive inverse reinforcement learning that observes gradients from a forward learner following Langevin dynamics. It uses Malliavin calculus to reformulate counterfactual gradients (conditioned on zero-probability events) as a ratio of unconditioned expectations involving Malliavin derivatives and Skorohod integrals, claiming this recovers standard Monte Carlo estimation rates for a general Langevin structure.

Significance. If the explicit derivations hold with verifiable error bounds, the work would offer a principled alternative to kernel smoothing for counterfactual estimation in passive IRL settings, potentially enabling faster convergence without dimensionality curses. The approach leverages advanced stochastic analysis tools in a novel way for RL, which could influence future work on efficient gradient estimation from observed trajectories.

major comments (2)

[Abstract] Abstract: The central claim that the counterfactual gradient is recovered as a ratio E[·]/E[·] involving Malliavin quantities for a 'general Langevin structure' is load-bearing, yet no assumptions on the drift b(X) and diffusion σ(X) (e.g., global Lipschitz, uniform ellipticity, or polynomial growth) are stated. Without these, the existence of explicit closed-form Malliavin derivatives and invertible covariance for the Skorohod integral is unclear for arbitrary nonlinear state-dependent coefficients, as highlighted by the stress-test concern.
[Abstract] Abstract: The assertion that the reformulation recovers 'standard estimation rates' lacks any supporting error analysis, variance bounds, or convergence statement in the provided description. Since the manuscript's novelty rests on transferring Monte Carlo rates to the adaptive-IRL counterfactual setting, a detailed derivation (presumably in the methods section) with explicit rate statements is required to substantiate the claim.

minor comments (1)

Define the Malliavin derivative operator D and Skorohod integral notation explicitly on first use, with reference to a standard text such as Nualart (2006) for readers unfamiliar with the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and will revise the manuscript to improve clarity on assumptions and convergence analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the counterfactual gradient is recovered as a ratio E[·]/E[·] involving Malliavin quantities for a 'general Langevin structure' is load-bearing, yet no assumptions on the drift b(X) and diffusion σ(X) (e.g., global Lipschitz, uniform ellipticity, or polynomial growth) are stated. Without these, the existence of explicit closed-form Malliavin derivatives and invertible covariance for the Skorohod integral is unclear for arbitrary nonlinear state-dependent coefficients, as highlighted by the stress-test concern.

Authors: We agree that the assumptions are essential for the Malliavin calculus framework. The full manuscript relies on standard conditions (global Lipschitz continuity of b and σ, uniform ellipticity of the diffusion, and polynomial growth) to ensure Malliavin derivatives exist in closed form and the Malliavin covariance process is invertible a.s. We will explicitly list these assumptions in the abstract and introduction, with a short justification referencing classical results on Malliavin calculus for SDEs. revision: yes
Referee: [Abstract] Abstract: The assertion that the reformulation recovers 'standard estimation rates' lacks any supporting error analysis, variance bounds, or convergence statement in the provided description. Since the manuscript's novelty rests on transferring Monte Carlo rates to the adaptive-IRL counterfactual setting, a detailed derivation (presumably in the methods section) with explicit rate statements is required to substantiate the claim.

Authors: The methods section derives the ratio-of-expectations estimator and shows it is unbiased, thereby inheriting the standard Monte Carlo rate O(1/sqrt(N)) for N independent samples (with explicit variance bounds derived from the Skorohod integral representation). We will revise the abstract to include a concise statement of these rates and add a short error-analysis paragraph summarizing the variance bounds already present in the methods. revision: partial

Circularity Check

0 steps flagged

No circularity: reformulation applies standard Malliavin calculus to Langevin SDE

full rationale

The paper's central step reformulates counterfactual gradients as a ratio of unconditional expectations via Malliavin derivatives and Skorohod integrals for a general Langevin structure. This is a direct application of established Malliavin calculus identities to the forward SDE, without any reduction of the claimed estimator to fitted parameters, self-defined quantities, or load-bearing self-citations. The derivation remains self-contained because the required Malliavin objects are invoked from external theory under stated existence assumptions, and no equation equates the output estimator to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of Malliavin derivatives for the Langevin process and the validity of the Skorohod integral adjoint; these are standard in the domain but not independently verified here.

axioms (1)

domain assumption Forward learner dynamics follow a general Langevin structure
Invoked to derive the Malliavin quantities and Skorohod integrals

pith-pipeline@v0.9.0 · 5471 in / 1078 out tokens · 30080 ms · 2026-05-13T22:05:28.228582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Algorithms for inverse reinforcement learning,

A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” inInternational Conference on Machine Learning, 2000, pp. 663–670

work page 2000
[2]

Langevin dynamics for adaptive inverse reinforcement learning of stochastic gradient algorithms,

V . Krishnamurthy and G. Yin, “Langevin dynamics for adaptive inverse reinforcement learning of stochastic gradient algorithms,”Journal of Machine Learning Research, vol. 22, pp. 1–49, 2021

work page 2021
[3]

Multikernel passive stochastic gradient algorithms and transfer learning,

——, “Multikernel passive stochastic gradient algorithms and transfer learning,”IEEE Transactions on Automatic Control, vol. 67, no. 4, pp. 1792–1805, 2022

work page 2022
[4]

Finite-sample bounds for adaptive inverse reinforcement learning using passive langevin dynamics,

L. Snow and V . Krishnamurthy, “Finite-sample bounds for adaptive inverse reinforcement learning using passive langevin dynamics,”IEEE Transactions on Information Theory, vol. 71, no. 6, pp. 4637–4670, 2025

work page 2025
[5]

Passive stochastic approximation with constant step size and window width,

G. Yin and K. Yin, “Passive stochastic approximation with constant step size and window width,”IEEE Transactions on Automatic Control, vol. 41, no. 1, pp. 90–106, 1996

work page 1996
[6]

H. J. Kushner and G. Yin,Stochastic Approximation Algorithms and Recursive Algorithms and Applications, 2nd ed. Springer-Verlag, 2003

work page 2003
[7]

Passive stochastic approximation,

A. V . Nazin, B. T. Polyak, and A. B. Tsybakov, “Passive stochastic approximation,”Automation and Remote Control, no. 50, pp. 1563– 1569, 1989

work page 1989
[8]

Malliavin calculus with weak derivatives for counterfactual stochastic optimization,

V . Krishnamurthy and L. Snow, “Malliavin calculus with weak derivatives for counterfactual stochastic optimization,”arXiv preprint arXiv:2510.00297, 2025

work page arXiv 2025
[9]

On the Malliavin approach to Monte Carlo approximation of conditional expectations,

B. Bouchard, I. Ekeland, and N. Touzi, “On the Malliavin approach to Monte Carlo approximation of conditional expectations,”Finance and Stochastics, vol. 8, no. 1, pp. 45–71, 2004

work page 2004
[10]

Efficiency estimation of production functions,

S. N. Afriat, “Efficiency estimation of production functions,”Interna- tional economic review, pp. 568–598, 1972

work page 1972
[11]

Afriat’s theorem and some extensions to choice under uncertainty,

W. Diewert, “Afriat’s theorem and some extensions to choice under uncertainty,”The Economic Journal, vol. 122, no. 560, pp. 305–331, 2012

work page 2012
[12]

Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,” in Conference on Learning Theory, 2017, pp. 1674–1703

work page 2017
[13]

Bayesian learning via stochastic gradient Langevin dynamics,

M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inInternational Conference on Machine Learning, 2011, pp. 681–688

work page 2011
[14]

Nualart,The Malliavin calculus and related topics

D. Nualart,The Malliavin calculus and related topics. Springer, 2006

work page 2006
[15]

Applications of Malliavin calculus to Monte-Carlo methods in finance. II,

E. Fourni ´e, J.-M. Lasry, J. Lebuchoux, and P.-L. Lions, “Applications of Malliavin calculus to Monte-Carlo methods in finance. II,”Finance and Stochastics, vol. 5, no. 2, pp. 201–236, 2001

work page 2001
[16]

On the Monte Carlo simulation of BSDEs: An improvement on the Malliavin weights,

D. Crisan, K. Manolarakis, and N. Touzi, “On the Monte Carlo simulation of BSDEs: An improvement on the Malliavin weights,” Stochastic Processes and their Applications, vol. 120, no. 7, pp. 1133– 1158, 2010

work page 2010
[17]

Sensitivity analysis using Ito–Malliavin calculus and martingales, and application to stochastic optimal control,

E. Gobet and R. Munos, “Sensitivity analysis using Ito–Malliavin calculus and martingales, and application to stochastic optimal control,” SIAM Journal on control and optimization, vol. 43, no. 5, pp. 1676– 1713, 2005

work page 2005
[18]

Estimating multidimensional density functions using the malliavin–thalmaier formula,

A. Kohatsu-Higa and K. Yasuda, “Estimating multidimensional density functions using the malliavin–thalmaier formula,”SIAM Journal on Numerical Analysis, vol. 47, no. 2, pp. 1546–1575, 2009. VII. APPENDIX A. Supporting Lemmas Lemma 6:L ′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt B. Proofs

work page 2009
[19]

Differentiating (18) with respect to the initial condition gives Ys = 1− Z s 0 ∇2L(Xu)Yu du, or equivalently d ds Ys =−∇ 2L(Xs)Ys, Y 0 = 1

Proof of Lemma 2:Let Ys :=∇ xXs. Differentiating (18) with respect to the initial condition gives Ys = 1− Z s 0 ∇2L(Xu)Yu du, or equivalently d ds Ys =−∇ 2L(Xs)Ys, Y 0 = 1. Hence Ys = exp − Z s 0 ∇2L(Xu)du

work page
[20]

Thus, substituting and rearranging terms gives us L′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt

Proof of Lemma 6:It ´o’s Lemma tells us that dL′(Xt) = (L′′′(Xt)−L ′(Xt)L′′(Xt))dt+ √ 2L′′(Xt)dWt and furthermore we may writedW t as dWt = (dXt +L ′(Xt)dt)/ √ 2 by definition of the forward Langevin dynamics. Thus, substituting and rearranging terms gives us L′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt

work page
[21]

Proof of Lemma 4:By Lemma 6, we may write out DtΓas DtΓ = Γ Z s 0 DtXu[dL′(Xu)−L ′′(Xu)dXu] = √ 2Γ Z s 0 exp − Z u t L′′(Xγ)dγ [dL′(Xu) −L ′′(Xu)dXu] Thus, S(u) = ΓS(˜u)− Z T 0 DtΓ·˜utdt = exp Z s 0 L′′(Xu)du · Z T 0 1√ 2T exp − Z t 0 L′′(Xu)du dWt − Z T 0 Γ Z s 0 exp − Z u t L′′(Xγ)dγ [dL′(Xu) −L ′′(Xu)dXu]· 1 T exp − Z t 0 L′′(Xu)du dt (31)

work page
[22]

Proof of Theorem 5:The main idea is as follows. By the law of large numbers, the empirical numerator and denominator in Algorithm 1 converge almost surely to their population counterparts, so the Malliavin estimator is consistent. Substituting this estimator into the outer update yields an Euler–Maruyama scheme, αk+1 =α k−ηd∇LN(αk)+ p 2β−1η ξk+1, ξ k+1 ∼N...

work page

[1] [1]

Algorithms for inverse reinforcement learning,

A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” inInternational Conference on Machine Learning, 2000, pp. 663–670

work page 2000

[2] [2]

Langevin dynamics for adaptive inverse reinforcement learning of stochastic gradient algorithms,

V . Krishnamurthy and G. Yin, “Langevin dynamics for adaptive inverse reinforcement learning of stochastic gradient algorithms,”Journal of Machine Learning Research, vol. 22, pp. 1–49, 2021

work page 2021

[3] [3]

Multikernel passive stochastic gradient algorithms and transfer learning,

——, “Multikernel passive stochastic gradient algorithms and transfer learning,”IEEE Transactions on Automatic Control, vol. 67, no. 4, pp. 1792–1805, 2022

work page 2022

[4] [4]

Finite-sample bounds for adaptive inverse reinforcement learning using passive langevin dynamics,

L. Snow and V . Krishnamurthy, “Finite-sample bounds for adaptive inverse reinforcement learning using passive langevin dynamics,”IEEE Transactions on Information Theory, vol. 71, no. 6, pp. 4637–4670, 2025

work page 2025

[5] [5]

Passive stochastic approximation with constant step size and window width,

G. Yin and K. Yin, “Passive stochastic approximation with constant step size and window width,”IEEE Transactions on Automatic Control, vol. 41, no. 1, pp. 90–106, 1996

work page 1996

[6] [6]

H. J. Kushner and G. Yin,Stochastic Approximation Algorithms and Recursive Algorithms and Applications, 2nd ed. Springer-Verlag, 2003

work page 2003

[7] [7]

Passive stochastic approximation,

A. V . Nazin, B. T. Polyak, and A. B. Tsybakov, “Passive stochastic approximation,”Automation and Remote Control, no. 50, pp. 1563– 1569, 1989

work page 1989

[8] [8]

Malliavin calculus with weak derivatives for counterfactual stochastic optimization,

V . Krishnamurthy and L. Snow, “Malliavin calculus with weak derivatives for counterfactual stochastic optimization,”arXiv preprint arXiv:2510.00297, 2025

work page arXiv 2025

[9] [9]

On the Malliavin approach to Monte Carlo approximation of conditional expectations,

B. Bouchard, I. Ekeland, and N. Touzi, “On the Malliavin approach to Monte Carlo approximation of conditional expectations,”Finance and Stochastics, vol. 8, no. 1, pp. 45–71, 2004

work page 2004

[10] [10]

Efficiency estimation of production functions,

S. N. Afriat, “Efficiency estimation of production functions,”Interna- tional economic review, pp. 568–598, 1972

work page 1972

[11] [11]

Afriat’s theorem and some extensions to choice under uncertainty,

W. Diewert, “Afriat’s theorem and some extensions to choice under uncertainty,”The Economic Journal, vol. 122, no. 560, pp. 305–331, 2012

work page 2012

[12] [12]

Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,” in Conference on Learning Theory, 2017, pp. 1674–1703

work page 2017

[13] [13]

Bayesian learning via stochastic gradient Langevin dynamics,

M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inInternational Conference on Machine Learning, 2011, pp. 681–688

work page 2011

[14] [14]

Nualart,The Malliavin calculus and related topics

D. Nualart,The Malliavin calculus and related topics. Springer, 2006

work page 2006

[15] [15]

Applications of Malliavin calculus to Monte-Carlo methods in finance. II,

E. Fourni ´e, J.-M. Lasry, J. Lebuchoux, and P.-L. Lions, “Applications of Malliavin calculus to Monte-Carlo methods in finance. II,”Finance and Stochastics, vol. 5, no. 2, pp. 201–236, 2001

work page 2001

[16] [16]

On the Monte Carlo simulation of BSDEs: An improvement on the Malliavin weights,

D. Crisan, K. Manolarakis, and N. Touzi, “On the Monte Carlo simulation of BSDEs: An improvement on the Malliavin weights,” Stochastic Processes and their Applications, vol. 120, no. 7, pp. 1133– 1158, 2010

work page 2010

[17] [17]

Sensitivity analysis using Ito–Malliavin calculus and martingales, and application to stochastic optimal control,

E. Gobet and R. Munos, “Sensitivity analysis using Ito–Malliavin calculus and martingales, and application to stochastic optimal control,” SIAM Journal on control and optimization, vol. 43, no. 5, pp. 1676– 1713, 2005

work page 2005

[18] [18]

Estimating multidimensional density functions using the malliavin–thalmaier formula,

A. Kohatsu-Higa and K. Yasuda, “Estimating multidimensional density functions using the malliavin–thalmaier formula,”SIAM Journal on Numerical Analysis, vol. 47, no. 2, pp. 1546–1575, 2009. VII. APPENDIX A. Supporting Lemmas Lemma 6:L ′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt B. Proofs

work page 2009

[19] [19]

Differentiating (18) with respect to the initial condition gives Ys = 1− Z s 0 ∇2L(Xu)Yu du, or equivalently d ds Ys =−∇ 2L(Xs)Ys, Y 0 = 1

Proof of Lemma 2:Let Ys :=∇ xXs. Differentiating (18) with respect to the initial condition gives Ys = 1− Z s 0 ∇2L(Xu)Yu du, or equivalently d ds Ys =−∇ 2L(Xs)Ys, Y 0 = 1. Hence Ys = exp − Z s 0 ∇2L(Xu)du

work page

[20] [20]

Thus, substituting and rearranging terms gives us L′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt

Proof of Lemma 6:It ´o’s Lemma tells us that dL′(Xt) = (L′′′(Xt)−L ′(Xt)L′′(Xt))dt+ √ 2L′′(Xt)dWt and furthermore we may writedW t as dWt = (dXt +L ′(Xt)dt)/ √ 2 by definition of the forward Langevin dynamics. Thus, substituting and rearranging terms gives us L′′′(Xt)dt=dL ′(Xt)−L ′′(Xt)dXt

work page

[21] [21]

Proof of Lemma 4:By Lemma 6, we may write out DtΓas DtΓ = Γ Z s 0 DtXu[dL′(Xu)−L ′′(Xu)dXu] = √ 2Γ Z s 0 exp − Z u t L′′(Xγ)dγ [dL′(Xu) −L ′′(Xu)dXu] Thus, S(u) = ΓS(˜u)− Z T 0 DtΓ·˜utdt = exp Z s 0 L′′(Xu)du · Z T 0 1√ 2T exp − Z t 0 L′′(Xu)du dWt − Z T 0 Γ Z s 0 exp − Z u t L′′(Xγ)dγ [dL′(Xu) −L ′′(Xu)dXu]· 1 T exp − Z t 0 L′′(Xu)du dt (31)

work page

[22] [22]

Proof of Theorem 5:The main idea is as follows. By the law of large numbers, the empirical numerator and denominator in Algorithm 1 converge almost surely to their population counterparts, so the Malliavin estimator is consistent. Substituting this estimator into the outer update yields an Euler–Maruyama scheme, αk+1 =α k−ηd∇LN(αk)+ p 2β−1η ξk+1, ξ k+1 ∼N...

work page