VIPO improves model-based offline RL by minimizing value function inconsistency between direct data estimates and model predictions, achieving SOTA results on D4RL and NeoRL benchmarks.
In this work, we achieve conservatism by incorporating the value function inconsistency loss, enabling the training of a more reliable model
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning
VIPO improves model-based offline RL by minimizing value function inconsistency between direct data estimates and model predictions, achieving SOTA results on D4RL and NeoRL benchmarks.