Estimating Dynamic Marginal Policy Effects under Sequential Unconfoundedness

I-han Lai; Stefan Wager

arxiv: 2604.05639 · v2 · pith:SK5JTIP6new · submitted 2026-04-07 · 📊 stat.ME

Estimating Dynamic Marginal Policy Effects under Sequential Unconfoundedness

I-han Lai , Stefan Wager This is my paper

Pith reviewed 2026-05-19 16:52 UTC · model grok-4.3

classification 📊 stat.ME

keywords dynamic marginal policy effectssequential unconfoundednessdoubly robust estimationoff-policy evaluationdynamic systemscausal inferencepolicy evaluation

0 comments

The pith

Dynamic marginal policy effects can be identified via tractable reduced-form expressions and estimated with a doubly robust estimator under sequential unconfoundedness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops methods for estimating how small policy changes affect long-term outcomes in dynamic systems. It establishes that dynamic marginal policy effects can be identified through simple reduced-form expressions rather than full dynamic modeling. A doubly robust estimator is proposed that operates under sequential unconfoundedness. The approach requires only partial observations of the system history instead of complete state information at each step. It also sidesteps the exponential growth in complexity that typically arises with longer time horizons.

Core claim

The paper shows that dynamic marginal policy effects can be identified via tractable reduced-form expressions and estimated under sequential unconfoundedness with a doubly robust estimator. This estimator does not require observing full dynamic state information, as is typical for off-policy evaluation in Markov decision processes, and avoids the exponential curse of horizon that arises in non-Markovian settings. Practicality is illustrated through simulations, including one drawn from a dynamic pricing application where past prices shape a reference level for current decisions.

What carries the argument

Reduced-form identification of dynamic marginal policy effects paired with a doubly robust estimator under sequential unconfoundedness.

If this is right

Long-term impacts of policy adjustments become estimable in dynamic settings with only partial state observations.
Estimation remains computationally feasible for long time horizons without exponential cost growth.
Policy evaluation gains robustness to misspecification through the double robustness property.
Applications such as dynamic pricing can incorporate reference-level effects from past decisions without full state data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduced-form approach may extend to causal estimation in longitudinal data with time-varying treatments observed only partially.
It could support online adjustment of policies by providing marginal effect estimates at each step.
Integration with flexible machine learning models for the nuisance functions might improve performance in high-dimensional histories.

Load-bearing premise

Sequential unconfoundedness holds so that treatment assignment at each time depends only on observed history without hidden confounding.

What would settle it

A simulation in which unobserved factors affect both the sequence of policies and the long-term outcomes, producing systematic bias in the doubly robust estimator.

Figures

Figures reproduced from arXiv: 2604.05639 by I-han Lai, Stefan Wager.

**Figure 2.** Figure 2: Sampling distributions of the four MPE estimators across replications in the dynamic [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

We develop methods for estimating how infinitesimal policy changes affect long-term outcomes in dynamic systems. We show that dynamic marginal policy effects (MPEs) can be identified via tractable reduced-form expressions, and can be estimated under a general sequential unconfoundedness assumption. We also propose a doubly robust estimator for dynamic MPEs. Our approach does not require observing full dynamic state information (as is typically assumed for off-policy evaluation in Markov decision processes), and does not incur an exponential curse of horizon (as is typical in non-Markovian off-policy evaluation). We demonstrate practicality and robustness of our approach in a number of simulations, including one motivated by a dynamic pricing application where people use past prices to form a reference level for current prices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives reduced-form identification for dynamic marginal policy effects plus a doubly robust estimator that works without full states or the usual horizon explosion.

read the letter

The main thing here is a reduced-form way to identify dynamic marginal policy effects in sequential settings, along with a doubly robust estimator under sequential unconfoundedness that skips the need for complete dynamic state information and avoids the exponential curse of the horizon. They express the effects as expectations of sums of conditional terms based on observed histories, which keeps the formulas tractable on paper. The dynamic pricing simulation, where agents form reference levels from past prices, tests this in a non-Markovian case and shows the estimator holds up in finite samples with some robustness checks. The doubly robust property is a clear practical plus, since it protects against misspecification in the nuisance models for the conditionals or propensities. This extends standard off-policy tools by relaxing the full-state MDP requirement while staying within sequential unconfoundedness. On the soft side, the stress-test point about growing histories has some bite. Even with reduced-form expressions, nonparametric estimation of history-dependent conditionals can still run into practical dimensionality trouble as time lengthens and past actions accumulate, especially if no low-dimensional sufficient statistic is assumed. The simulations use moderate horizons and structured features, so they do not fully rule out slower rates or instability in longer or higher-dimensional sequences. The paper would benefit from more explicit discussion of how the estimator scales or any regularization used for those terms. This is aimed at causal inference and econometrics readers who work on sequential policies, such as dynamic pricing, treatment regimes, or longitudinal data. Someone already familiar with off-policy evaluation but needing tools for partial observation and marginal effects would get direct value. The thinking is clear and the claims line up with the simulations and assumptions stated. It deserves a serious referee because the identification strategy is distinct enough and the estimator is implementable.

Referee Report

1 major / 2 minor

Summary. The manuscript develops methods for identifying and estimating dynamic marginal policy effects (MPEs) in sequential systems. It derives tractable reduced-form expressions for these effects under sequential unconfoundedness and proposes a doubly robust estimator. The approach is claimed not to require full dynamic state information and to avoid the exponential curse of horizon typical in non-Markovian off-policy evaluation. Practicality is shown through simulations, including a dynamic pricing example where agents condition on past prices to form reference levels.

Significance. If the central results hold, the work would offer a useful advance for causal inference in dynamic, possibly non-Markovian environments by enabling estimation of long-term effects of infinitesimal policy changes with reduced data requirements and without the usual dimensionality explosion. The doubly robust estimator and the dynamic pricing simulation provide concrete support for applicability in settings like economics and sequential decision-making.

major comments (1)

[§3] §3 (Identification): The reduced-form expression for the dynamic MPE is written as an expectation of summed terms involving conditional expectations of the outcome given the observed history up to each t. In the dynamic pricing simulation, where agents condition on histories of past prices whose dimension grows linearly with t, it is unclear whether the nonparametric estimation of these history-conditioned quantities avoids the curse of dimensionality; this point is load-bearing for the claim that the method sidesteps the exponential curse of horizon in non-Markovian settings.

minor comments (2)

[Abstract] Abstract: The statement that the estimator 'does not require observing full dynamic state information' would benefit from a one-sentence clarification of what minimal history is actually used.
[Simulations] Simulation section: Report the effective sample size and bandwidth choices used for the conditional expectations in the dynamic pricing example to allow readers to assess finite-sample behavior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our paper. We have carefully considered the major comment and provide our response below. We will make revisions to the manuscript to address the concerns raised regarding the estimation in high-dimensional history settings.

read point-by-point responses

Referee: [§3] §3 (Identification): The reduced-form expression for the dynamic MPE is written as an expectation of summed terms involving conditional expectations of the outcome given the observed history up to each t. In the dynamic pricing simulation, where agents condition on histories of past prices whose dimension grows linearly with t, it is unclear whether the nonparametric estimation of these history-conditioned quantities avoids the curse of dimensionality; this point is load-bearing for the claim that the method sidesteps the exponential curse of horizon in non-Markovian settings.

Authors: We appreciate the referee pointing out this subtlety. The reduced-form identification expresses the dynamic MPE as an expectation of summed terms, each involving a conditional expectation of the outcome given the observed history up to time t. This allows identification without requiring the full dynamic state or Markovian assumptions. Our claim to sidestep the exponential curse of horizon pertains to avoiding the accumulation of importance sampling weights over long trajectories, which typically causes exponential growth in variance with the horizon length. Our doubly robust estimator instead permits estimation of each time-specific term separately. We concur that fully nonparametric estimation of conditional expectations given histories whose dimension grows with t will be subject to the curse of dimensionality. The simulation uses histories of limited length where such estimation remains feasible, and we will add discussion in the revised manuscript clarifying the scope of our claims and the practical considerations for estimation in growing history dimensions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies dynamic marginal policy effects via reduced-form expressions under the external sequential unconfoundedness assumption, proposes a doubly robust estimator, and demonstrates it in simulations without any step that reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. The identification formula is presented as tractable by design and does not invoke prior author work to forbid alternatives or smuggle an ansatz; the central claim remains independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the sequential unconfoundedness assumption for identification; no free parameters, invented entities, or additional axioms are mentioned in the abstract.

axioms (1)

domain assumption Sequential unconfoundedness assumption holds for the dynamic system.
Stated as the basis for identification of dynamic MPEs.

pith-pipeline@v0.9.0 · 5647 in / 1135 out tokens · 26945 ms · 2026-05-19T16:52:56.681214+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that dynamic marginal policy effects (MPEs) can be identified via tractable reduced-form expressions... doubly robust estimator... does not incur an exponential curse of horizon
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generalized policy-gradient theorem... representation Θ = sum γ^{t-1} E[∫ q_t(s,a) π'_t(a|s) dλ(a)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Non-parametric causal inference in dynamic thresholding designs

Aditya Ghosh and Stefan Wager. Non-parametric causal inference in dynamic thresholding designs. arXiv preprint arXiv:2512.15244,

work page internal anchor Pith review arXiv
[2]

Switchback experiments under geometric mixing.arXiv preprint arXiv:2209.00197,

18 Yuchen Hu and Stefan Wager. Switchback experiments under geometric mixing.arXiv preprint arXiv:2209.00197,

work page arXiv
[3]

Estimation of treatment effects under nonstation- arity via the truncated policy gradient estimator.arXiv preprint arXiv:2506.05308,

Ramesh Johari, Tianyi Peng, and Wenqian Xing. Estimation of treatment effects under nonstation- arity via the truncated policy gradient estimator.arXiv preprint arXiv:2506.05308,

work page arXiv
[4]

Yuya Sasaki and Takuya Ura

doi: 10.1111/biom.13859. Yuya Sasaki and Takuya Ura. Estimation and inference for policy relevant treatment effects.Journal of Econometrics, 234(2):394–450,

work page doi:10.1111/biom.13859

[1] [1]

Non-parametric causal inference in dynamic thresholding designs

Aditya Ghosh and Stefan Wager. Non-parametric causal inference in dynamic thresholding designs. arXiv preprint arXiv:2512.15244,

work page internal anchor Pith review arXiv

[2] [2]

Switchback experiments under geometric mixing.arXiv preprint arXiv:2209.00197,

18 Yuchen Hu and Stefan Wager. Switchback experiments under geometric mixing.arXiv preprint arXiv:2209.00197,

work page arXiv

[3] [3]

Estimation of treatment effects under nonstation- arity via the truncated policy gradient estimator.arXiv preprint arXiv:2506.05308,

Ramesh Johari, Tianyi Peng, and Wenqian Xing. Estimation of treatment effects under nonstation- arity via the truncated policy gradient estimator.arXiv preprint arXiv:2506.05308,

work page arXiv

[4] [4]

Yuya Sasaki and Takuya Ura

doi: 10.1111/biom.13859. Yuya Sasaki and Takuya Ura. Estimation and inference for policy relevant treatment effects.Journal of Econometrics, 234(2):394–450,

work page doi:10.1111/biom.13859