The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Journal of Machine Learning Research , volume=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
citing papers explorer
-
Policy Gradient Methods for Non-Markovian Reinforcement Learning
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.