Identifies full-data conditional mean rewards under MNAR missingness via shadow variables and a bridge function, then builds a consistent FQE-style OPE estimator for missingness-aware policies.
arXiv preprint arXiv:1911.06854 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
ADWM learns a latent diffusion world model with per-transition independent denoising and policy-conditioned guidance to enable accurate offline evaluation of LLM agent policies.
MedGym introduces a continuous-time RL benchmark for medical treatment derived from clinical data via PINNs, supporting offline/online evaluation on personalization, safety, and discrete vs continuous methods.
citing papers explorer
-
Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
Identifies full-data conditional mean rewards under MNAR missingness via shadow variables and a bridge function, then builds a consistent FQE-style OPE estimator for missingness-aware policies.
-
Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
ADWM learns a latent diffusion world model with per-transition independent denoising and policy-conditioned guidance to enable accurate offline evaluation of LLM agent policies.
-
MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning
MedGym introduces a continuous-time RL benchmark for medical treatment derived from clinical data via PINNs, supporting offline/online evaluation on personalization, safety, and discrete vs continuous methods.