Offline Reinforcement Learning with Universal Horizon Models
Pith reviewed 2026-05-20 19:40 UTC · model grok-4.3
The pith
Universal horizon models directly predict future states at arbitrary times to enable stable offline reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Universal horizon models (UHM) directly predict the future state for any finite horizon h. Combined with a winsorized horizon distribution that caps large horizons, this enables scalable value learning from imagined trajectories. The approach outperforms competitive baselines on 100 OGBench tasks, with particular improvements on highly suboptimal datasets and tasks needing long-horizon reasoning.
What carries the argument
Universal horizon models that directly predict states under arbitrary horizons, paired with a winsorized horizon distribution for training stability.
If this is right
- Scalable value learning becomes possible without repeated model rollouts that accumulate errors.
- Training remains stable even when considering long but finite horizons through the capping mechanism.
- Superior performance emerges on suboptimal data and long-horizon tasks compared to prior model-based methods.
- The generalization from geometric horizon models allows more flexible planning depths.
Where Pith is reading between the lines
- This technique might apply to settings beyond offline RL, such as planning in partially observable environments.
- The winsorized distribution could be adapted dynamically based on task difficulty to further improve results.
- If the direct prediction holds for very large but capped horizons, it opens questions about the trade-off between horizon length and prediction accuracy.
Load-bearing premise
Predicting states directly at arbitrary horizons reduces compounding errors more than repeated short predictions or fixed infinite-horizon models, without adding significant new errors for distant states.
What would settle it
Observing higher prediction errors for states at large horizons or no performance gain over baselines on the OGBench tasks with suboptimal data would falsify the effectiveness of the universal horizon model approach.
Figures
read the original abstract
Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: https://rllab-snu.github.io/projects/UHM/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes universal horizon models (UHM) as a generalization of geometric horizon models (GHM) for model-based offline RL. UHM directly predicts future states conditioned on arbitrary finite horizons h, and the authors combine this with a winsorized horizon distribution to enable stable value learning via imagined trajectories. They report that the resulting method outperforms competitive baselines on 100 OGBench tasks, with particular gains on highly suboptimal datasets and tasks requiring long-horizon reasoning.
Significance. If the central empirical claim holds after addressing the modeling-error concerns below, the work would offer a concrete advance in scalable model-based offline RL by relaxing the infinite-horizon assumption of GHM while avoiding the compounding-error accumulation of repeated one-step rollouts. The reported gains on long-horizon and suboptimal tasks would be a useful data point for the community, especially if accompanied by reproducible code or explicit falsifiable predictions about horizon-dependent prediction error.
major comments (2)
- [§3] §3: The motivation correctly identifies GHM's difficulty with distant states, yet the central claim that UHM plus winsorization yields more accurate imagined trajectories rests on the untested premise that the learned conditional p(s_{t+h} | s_t, a_t, h) does not incur higher error at large h that offsets the reduction in compounding error. No direct measurement of state-prediction MSE or value-estimate bias as a function of h is provided to substantiate this.
- [Experimental results] Experimental results: The abstract and results section claim outperformance on 100 OGBench tasks without reporting per-task error bars, statistical significance tests, or ablations that isolate the contribution of the winsorized horizon distribution versus the UHM architecture itself. This makes it difficult to assess whether the gains are robust or sensitive to post-hoc hyperparameter choices.
minor comments (2)
- [Method] Notation: The definition of the winsorized horizon distribution should be stated explicitly with the capping threshold and sampling procedure, preferably as an equation.
- The project page link is useful; including a short description of the released code and checkpoints would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for strengthening the empirical support and presentation of our results. We address each major comment below and commit to the corresponding revisions in the updated manuscript.
read point-by-point responses
-
Referee: [§3] §3: The motivation correctly identifies GHM's difficulty with distant states, yet the central claim that UHM plus winsorization yields more accurate imagined trajectories rests on the untested premise that the learned conditional p(s_{t+h} | s_t, a_t, h) does not incur higher error at large h that offsets the reduction in compounding error. No direct measurement of state-prediction MSE or value-estimate bias as a function of h is provided to substantiate this.
Authors: We agree that direct empirical measurements of prediction error versus horizon would provide stronger substantiation for the central modeling claim. In the revised manuscript we will add new figures in Section 3 (or a dedicated appendix) that report state-prediction MSE and value-estimate bias as explicit functions of h for both UHM and GHM on representative OGBench tasks. These plots will quantify whether error growth at large h remains modest enough to preserve the benefit of reduced compounding error. revision: yes
-
Referee: [Experimental results] Experimental results: The abstract and results section claim outperformance on 100 OGBench tasks without reporting per-task error bars, statistical significance tests, or ablations that isolate the contribution of the winsorized horizon distribution versus the UHM architecture itself. This makes it difficult to assess whether the gains are robust or sensitive to post-hoc hyperparameter choices.
Authors: We acknowledge that the current experimental reporting lacks the statistical detail and targeted ablations needed for full assessment. In the revision we will (i) report per-task normalized scores together with standard deviations over at least five random seeds, (ii) include aggregate statistical significance tests (e.g., Wilcoxon signed-rank test across the 100 tasks), and (iii) add ablation tables that separately disable the winsorized horizon distribution while keeping the UHM architecture fixed, and vice versa. These additions will appear in the main results section and an expanded appendix. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines UHM as a direct generalization of GHM to support arbitrary finite horizons instead of infinite discounted prediction, then introduces a winsorized horizon distribution as a training stabilization choice. Neither step reduces to self-definition, a fitted parameter renamed as prediction, or a self-citation chain; the model p(s_{t+h} | s_t, a_t, h) and the value-learning procedure are specified independently of the claimed performance gains. Empirical results on the external 100-task OGBench benchmark supply independent evidence rather than tautological confirmation. No load-bearing equation or premise collapses to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons... winsorized horizon distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reinforcement learning: An introduction , author=. 1998 , publisher=
work page 1998
-
[2]
Proceedings of the International Conference on Learning Representations , year=
High-dimensional continuous control using generalized advantage estimation , author=. Proceedings of the International Conference on Learning Representations , year=
-
[3]
Proceedings of the AAAI conference on artificial intelligence , year=
Rainbow: Combining improvements in deep reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , year=
-
[4]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Understanding multi-step deep reinforcement learning: A systematic study of the
Hernandez-Garcia, J Fernando and Sutton, Richard S , journal=. Understanding multi-step deep reinforcement learning: A systematic study of the
-
[6]
Proceedings of the AAAI conference on artificial intelligence , year=
Multi-step reinforcement learning: A unifying algorithm , author=. Proceedings of the AAAI conference on artificial intelligence , year=
-
[7]
Park, Seohong and Frans, Kevin and Mann, Deepinder and Eysenbach, Benjamin and Kumar, Aviral and Levine, Sergey , booktitle=. Horizon Reduction Makes
-
[8]
Park, Seohong and Oberai, Aditya and Atreya, Pranav and Levine, Sergey , booktitle=. Transitive
-
[9]
Proceedings of the Conference on Robot Learning , year=
Latent plans for task-agnostic offline reinforcement learning , author=. Proceedings of the Conference on Robot Learning , year=
-
[10]
Proceedings of the International Conference on Learning Representations , year=
Parrot: Data-driven behavioral priors for reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=
-
[11]
Deep reinforcement learning with double
Van Hasselt, Hado and Guez, Arthur and Silver, David , booktitle=. Deep reinforcement learning with double
-
[12]
Improving generalization for temporal difference learning: The successor representation , author=. Neural computation , volume=. 1993 , publisher=
work page 1993
-
[13]
arXiv preprint arXiv:2101.07123 , year=
Learning successor states and goal-dependent values: A mathematical viewpoint , author=. arXiv preprint arXiv:2101.07123 , year=
-
[14]
Proceedings of the International Conference on Machine Learning , year=
Temporal Difference Flows , author=. Proceedings of the International Conference on Machine Learning , year=
-
[15]
Advances in Neural Information Processing Systems , year=
-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction , author=. Advances in Neural Information Processing Systems , year=
-
[16]
Proceedings of the International Conference on Machine Learning , year=
Generalised policy improvement with geometric policy composition , author=. Proceedings of the International Conference on Machine Learning , year=
-
[17]
Proceedings of the International Conference on Learning Representations , year=
Intention-Conditioned Flow Occupancy Models , author=. Proceedings of the International Conference on Learning Representations , year=
-
[18]
Proceedings of the International Conference on Machine Learning , year=
Model-based value expansion for efficient model-free reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=
- [19]
-
[20]
Kidambi, Rahul and Rajeswaran, Aravind and Netrapalli, Praneeth and Joachims, Thorsten , booktitle=
-
[21]
Advances in Neural Information Processing Systems , year=
When to trust your model: Model-based policy optimization , author=. Advances in Neural Information Processing Systems , year=
-
[22]
Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , booktitle=
-
[23]
Proceedings of the International Conference on Machine Learning , year=
Model-Bellman inconsistency for model-based offline reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=
-
[24]
Park, Kwanyoung and Park, Seohong and Lee, Youngwoon and Levine, Sergey , booktitle=. Scalable Offline Model-Based
-
[25]
Model-based Offline Reinforcement Learning with Lower Expectile
Park, Kwanyoung and Lee, Youngwoon , booktitle=. Model-based Offline Reinforcement Learning with Lower Expectile
-
[26]
Proceedings of the International Conference on Learning Representations , year=
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization , author=. Proceedings of the International Conference on Learning Representations , year=
-
[27]
Mastering diverse control tasks through world models , author=. Nature , pages=. 2025 , publisher=
work page 2025
-
[28]
Hansen, Nicklas and Su, Hao and Wang, Xiaolong , booktitle=. T
-
[29]
Proceedings of the International Conference on Machine Learning , year=
Efficient world models with context-aware tokenization , author=. Proceedings of the International Conference on Machine Learning , year=
-
[30]
Advances in Neural Information Processing Systems , year=
Offline reinforcement learning as one big sequence modeling problem , author=. Advances in Neural Information Processing Systems , year=
-
[31]
Proceedings of the International Conference on Machine Learning , year=
Planning with Diffusion for Flexible Behavior Synthesis , author=. Proceedings of the International Conference on Machine Learning , year=
-
[32]
Proceedings of the Reinforcement Learning Conference , year=
Policy-guided diffusion , author=. Proceedings of the Reinforcement Learning Conference , year=
-
[33]
Advances in Neural Information Processing Systems , year=
Reinforcement learning with action chunking , author=. Advances in Neural Information Processing Systems , year=
-
[34]
Proceedings of the International Conference on Learning Representations , year =
Qiyang Li and Seohong Park and Sergey Levine , title =. Proceedings of the International Conference on Learning Representations , year =
- [35]
-
[36]
Advances in Neural Information Processing Systems , year=
Diffused task-agnostic milestone planner , author=. Advances in Neural Information Processing Systems , year=
-
[37]
Cheng, Jie and Qiao, Ruixi and Ma, Yingwei and Li, Binhua and Xiong, Gang and Miao, Qinghai and Li, Yongbin and Lv, Yisheng , booktitle=. Scaling offline model-based
-
[38]
Proceedings of the International Conference on Learning Representations , year=
Is conditional generative modeling all you need for decision-making? , author=. Proceedings of the International Conference on Learning Representations , year=
-
[39]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[40]
Park, Seohong and Frans, Kevin and Eysenbach, Benjamin and Levine, Sergey , booktitle=
-
[41]
Seohong Park and Qiyang Li and Sergey Levine , booktitle=. Flow
-
[42]
Advances in Neural Information Processing Systems , year=
Revisiting the minimalist approach to offline reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
-
[43]
Advances in Neural Information Processing Systems , year=
A minimalist approach to offline reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
-
[44]
Offline reinforcement learning with implicit
Kostrikov, Ilya and Nair, Ashvin and Levine, Sergey , booktitle=. Offline reinforcement learning with implicit
-
[45]
Garg, Divyansh and Hejna, Joey and Geist, Matthieu and Ermon, Stefano , booktitle=. Extreme
-
[46]
Xu, Haoran and Jiang, Li and Li, Jianxiong and Yang, Zhuoran and Wang, Zhaoran and Chan, Victor Wai Kin and Zhan, Xianyuan , booktitle=. Offline
-
[47]
Advances in Neural Information Processing Systems , year=
Uncertainty-based offline reinforcement learning with diversified q-ensemble , author=. Advances in Neural Information Processing Systems , year=
-
[48]
Ghasemipour, Kamyar and Gu, Shixiang Shane and Nachum, Ofir , booktitle=. Why so pessimistic?
-
[49]
Proceedings of the International Conference on Machine Learning , year=
Uncertainty weighted actor-critic for offline reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=
-
[50]
Kumar, Aviral and Zhou, Aurick and Tucker, George and Levine, Sergey , booktitle=. Conservative
-
[51]
Sikchi, Harshit and Zheng, Qinqing and Zhang, Amy and Niekum, Scott , booktitle=. Dual
-
[52]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[53]
Proceedings of the International Conference on Machine Learning , year=
Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the International Conference on Machine Learning , year=
-
[54]
Hansen-Estruch, Philippe and Kostrikov, Ilya and Janner, Michael and Kuba, Jakub Grudzien and Levine, Sergey , journal=
-
[55]
Proceedings of the International Conference on Learning Representations , year=
Offline reinforcement learning via high-fidelity generative behavior modeling , author=. Proceedings of the International Conference on Learning Representations , year=
-
[56]
The International Journal of Robotics Research , volume=
Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=
work page 2025
-
[57]
arXiv preprint arXiv:2509.23087 , year=
Unleashing Flow Policies with Distributional Critics , author=. arXiv preprint arXiv:2509.23087 , year=
-
[58]
Proceedings of the International Conference on Learning Representations , year=
Flow matching for generative modeling , author=. Proceedings of the International Conference on Learning Representations , year=
-
[59]
Proceedings of the International Conference on Machine Learning , year=
Approximately optimal approximate reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=
-
[60]
Proceedings of the International Conference on Artificial Intelligence and Statistics , year=
Efficient reductions for imitation learning , author=. Proceedings of the International Conference on Artificial Intelligence and Statistics , year=
-
[61]
Proceedings of the International Conference on Learning Representations , year=
Adam: A method for stochastic optimization , author=. Proceedings of the International Conference on Learning Representations , year=
- [62]
-
[63]
Proceedings of the International Conference on Learning Representations , year=
Any-step dynamics model improves future predictions for online and offline reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=
-
[64]
arXiv preprint arXiv:2302.12617 , year=
Leveraging jumpy models for planning and fast learning in robotic domains , author=. arXiv preprint arXiv:2302.12617 , year=
-
[65]
IEEE Transactions on Neural Networks and Learning Systems , volume=
A survey on offline reinforcement learning: Taxonomy, review, and open problems , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2023 , publisher=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.