SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56):1929–1958
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.