Recognition: unknown
Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data
Pith reviewed 2026-05-09 14:22 UTC · model grok-4.3
The pith
By using large language models to define costs for unsafe states and a learned dynamics model to simulate future violations, PROCO enables learning of safe policies from offline data that has few or no examples of constraint violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PROCO learns a dynamics model from the given offline data, builds a conservative cost function by grounding natural-language descriptions of unsafe states through large language models, and then runs model-based rollouts to create diverse counterfactual unsafe samples. These samples support reliable detection of safe-but-infeasible states and guide feasibility-aware policy optimization, yielding policies with fewer constraint violations than both the original offline safe RL methods and behavior-cloning baselines.
What carries the argument
The proactive cost generation process: an LLM-grounded conservative cost function combined with learned dynamics model rollouts that synthesize counterfactual unsafe samples for feasibility identification.
If this is right
- Existing offline safe RL algorithms integrate directly with PROCO and show lower violation rates on the same limited-violation datasets.
- Policies trained this way avoid states that satisfy constraints at the current step but lead to violations within a few steps.
- The method outperforms pure behavior cloning baselines in both safety and task performance across tested environments.
- It supports learning from datasets that contain exclusively safe trajectories without requiring any observed violations.
Where Pith is reading between the lines
- The same grounding-plus-rollout pattern could apply in other domains where textual safety rules exist but violation examples are absent from data.
- Improving the accuracy of the dynamics model or the LLM grounding step would likely amplify the safety gains without changing the core structure.
- This suggests hybrid systems that mix learned models with external knowledge sources may reduce reliance on dangerous data collection in real-world robotics and control.
Load-bearing premise
The dynamics model learned from offline data must be accurate enough to produce reliable multi-step counterfactual trajectories that actually predict future violations.
What would settle it
On a Safety-Gymnasium task with clean data, replacing the learned dynamics model with the true simulator and finding that PROCO produces no reduction in constraint violations compared to the baseline method would show the cost generation step does not help.
Figures
read the original abstract
Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PROCO, a model-based offline safe RL method for learning constraint-satisfying policies from datasets containing exclusively safe or minimally risky trajectories. It learns a dynamics model from the offline data, grounds a conservative cost function in natural-language knowledge from LLMs to estimate risks without observed violations, performs model-based rollouts to synthesize counterfactual unsafe samples, and uses these to enable reliable feasibility identification and feasibility-guided policy learning. The approach integrates with existing offline safe RL algorithms and is evaluated on Safety-Gymnasium tasks, claiming reduced constraint violations and improved safety performance over baselines.
Significance. If the empirical claims hold under rigorous validation of the counterfactual samples, the work would address a key practical gap in offline safe RL: handling high-stakes settings where unsafe data cannot be collected. It combines LLM knowledge with model-based synthesis in a way that could extend to other domains with sparse violation data, provided the extrapolation issues are resolved.
major comments (2)
- [Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.
- [Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.
minor comments (2)
- [Abstract] The abstract states empirical improvements but omits any mention of baselines, number of seeds, or statistical tests; these details should be summarized even in the abstract for clarity.
- [Method] Notation for the conservative cost function and feasibility identification should be introduced with explicit equations rather than prose descriptions to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the validation of the dynamics model and experimental details without altering the core claims.
read point-by-point responses
-
Referee: [Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.
Authors: We agree that the dynamics model, trained only on safe data, requires explicit validation when performing extrapolative rollouts to LLM-identified risky states. The conservative LLM-grounded cost function is intended to mitigate over-optimism by providing an independent risk signal, but this does not replace the need for rollout accuracy checks. In the revised manuscript we will add an error analysis subsection that reports prediction errors on simulated unsafe trajectories (generated via controlled perturbations of safe data), ensemble-based uncertainty estimates for the dynamics model, and an ablation comparing feasibility identification with and without rollout uncertainty thresholding. revision: yes
-
Referee: [Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.
Authors: The referee correctly notes that additional transparency is required. The datasets were generated by rolling out near-optimal safe policies (with violation rates below 1% per episode) in Safety-Gymnasium environments; we will expand the experimental setup section with the precise data-collection procedure, per-task violation counts in the offline data, and the exact number of trajectories. We will also report statistical significance (paired t-tests with p-values and confidence intervals) for all violation-reduction results versus baselines to confirm that observed gains are attributable to the LLM-grounded costs and counterfactual synthesis rather than implementation differences. revision: yes
Circularity Check
No circularity: derivation uses external LLM grounding and standard model-based synthesis independent of fitted inputs
full rationale
The paper's claimed chain proceeds as: (1) fit dynamics model to offline safe trajectories, (2) construct conservative cost via LLM grounding of natural-language unsafe-state descriptions (external to data), (3) roll out the model from safe states toward LLM-identified high-risk regions to synthesize counterfactual unsafe samples, (4) use those samples for feasibility identification and policy improvement. None of these steps reduces by construction to its own inputs; the cost function is not fitted to the target violations or policy performance, the rollouts are genuine forward simulation (even if extrapolation quality is debatable on correctness grounds), and no self-citation or uniqueness theorem is invoked to force the architecture. The method therefore remains self-contained against the Safety-Gymnasium benchmarks and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The dynamics model learned from offline data accurately predicts future states for rollouts.
- ad hoc to paper LLMs can provide reliable natural language knowledge about unsafe states in the specific domain.
invented entities (1)
-
Conservative cost function grounded in LLMs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[2]
A survey of llm-based methods for synthetic data generation and the rise of agentic workflows
Ahmad Alismail and Carsten Lanquillon. A survey of llm-based methods for synthetic data generation and the rise of agentic workflows. InInternational Conference on Human-Computer Interaction, pages 119–135. Springer, 2025
2025
-
[3]
Routledge, 2021
Eitan Altman.Constrained Markov decision processes. Routledge, 2021
2021
-
[4]
Hamilton-jacobi reachability: A brief overview and recent advances
Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242–2253, 2017
2017
-
[5]
Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022
Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022
2022
-
[6]
Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024
Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024
2024
-
[7]
Dime: Diffusion-based maximum entropy reinforcement learning
Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning. In Forty-second International Conference on Machine Learning, 2025
2025
-
[8]
Constraint- adaptive policy switching for offline safe reinforcement learning
Yassine Chemingui, Aryan Deshwal, Honghao Wei, Alan Fern, and Jana Doppa. Constraint- adaptive policy switching for offline safe reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15722–15730, 2025
2025
-
[9]
Decision transformer: reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: reinforcement learning via sequence modeling. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 15084–15097, 2021
2021
-
[10]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023
2023
-
[11]
Safe rlhf: Safe reinforcement learning from human feedback
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024. 10
2024
-
[12]
Diffusion-based reinforcement learning via q-weighted variational policy optimization
Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 53945–53968, 2024
2024
-
[13]
Consistency models as a rich and efficient policy class for reinforcement learning
Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[14]
Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester
Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforce- ment learning.arXiv preprint arXiv:1904.12901, 2019
-
[15]
Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. arXiv preprint arXiv:2505.22866, 2025
-
[16]
Bridging hamilton-jacobi safety analysis and reinforcement learning
Jaime F Fisac, Neil F Lugovoy, Vicenç Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton-jacobi safety analysis and reinforcement learning. In2019 International Conference on Robotics and Automation, pages 8550–8556, 2019
2019
-
[17]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062, 2019
2052
-
[18]
Iterative reachability estimation for safe reinforcement learning
Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, and Sicun Gao. Iterative reachability estimation for safe reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 69764–69797, 2023
2023
-
[19]
A comprehensive survey on safe reinforcement learning
Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015
2015
-
[20]
Worldgpt: Empowering llm as multimodal world model
Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024
2024
-
[21]
A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
2024
-
[22]
Constraint-conditioned actor-critic for offline safe reinforcement learning
Zijian Guo, Weichao Zhou, Shengao Wang, and Wenchao Li. Constraint-conditioned actor-critic for offline safe reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[23]
When to trust your model: model-based policy optimization
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: model-based policy optimization. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 12519–12530, 2019
2019
-
[24]
Planning with diffusion for flexible behavior synthesis
Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915, 2022
2022
-
[25]
Safety-gymnasium: a unified safe reinforcemei learning benchmark
Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Juntao Dai, and Yaodong Yang. Safety-gymnasium: a unified safe reinforcemei learning benchmark. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 18964–18993, 2023
2023
-
[26]
Smart-llm: Smart multi-agent robot task planning using large language models
Shyam Sundar Kannan, Vishnunandan LN Venkatesh, and Byung-Cheol Min. Smart-llm: Smart multi-agent robot task planning using large language models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 12140–12147. IEEE, 2024
2024
-
[27]
Morel: model-based offline reinforcement learning
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: model-based offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 21810–21823, 2020. 11
2020
-
[28]
Latent safety-constrained policy approach for safe offline reinforcement learning
Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, and Cody Fleming. Latent safety-constrained policy approach for safe offline reinforcement learning. InThe Thirteenth International Confer- ence on Learning Representations, 2025
2025
-
[29]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022
2022
-
[30]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 1179–1191, 2020
2020
-
[31]
Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation
Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. InInternational Conference on Learning Representations, 2022
2022
-
[32]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review arXiv 2005
-
[33]
Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination
Lihe Li, Lei Yuan, Pengsen Liu, Tao Jiang, and Yang Yu. Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination. InForty-second International Conference on Machine Learning, 2025
2025
-
[34]
Reinforcement learning with action chunking
Yinchuan Li, Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Xiu Li, Zhitang Chen, et al. Generative models in decision making: A survey.arXiv preprint arXiv:2502.17100, 2025
-
[35]
Constrained decision transformer for offline safe reinforcement learning
Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, and Ding Zhao. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pages 21611–21630, 2023
2023
-
[36]
Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024
Zuxin Liu, Zijian Guo, Haohong Lin, Yihang Yao, Jiacheng Zhu, Zhepeng Cen, Hanjiang Hu, Wenhao Yu, Tingnan Zhang, Jie Tan, and Ding Zhao. Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024
2024
-
[37]
Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025
Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025
-
[38]
Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
-
[39]
Eureka: Human-level reward design via coding large language models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[40]
Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025
Jing Nie, Jiachen Jiang, Yang Li, Huting Wang, Sezai Ercisli, and Linze Lv. Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025
2025
-
[41]
History compression via language models in reinforcement learning
Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-Zadeh, and Sepp Hochreiter. History compression via language models in reinforcement learning. InInternational Conference on Machine Learning, pages 17156–17185, 2022
2022
-
[42]
Flow q-learning.arXiv:2502.02538, 2025
Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv preprint arXiv:2502.02538, 2025
-
[43]
Adapt: As-needed decomposition and planning with language models
Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023
-
[44]
A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023
Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023. 12
2023
-
[45]
Reflexion: language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023
2023
-
[46]
Jiayang Song, Zhehua Zhou, Jiawei Liu, Chunrong Fang, Zhan Shu, and Lei Ma. Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics.arXiv preprint arXiv:2309.06687, 2023
-
[47]
Responsive safety in reinforcement learning by pid lagrangian methods
Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. InInternational Conference on Machine Learning, pages 9133– 9143, 2020
2020
-
[48]
Model-bellman inconsistency for model-based offline reinforcement learning
Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. InInternational Conference on Machine Learning, pages 33177–33194, 2023
2023
-
[49]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033, 2012
2012
-
[50]
Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks
Eleftherios Triantafyllidis, Filippos Christianos, and Zhibin Li. Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks. In2024 IEEE International Conference on Robotics and Automation, pages 7493–7500, 2024
2024
-
[51]
Diffusion policies as an expressive policy class for offline reinforcement learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[52]
Text2reward: Reward shaping with language models for reinforcement learning
Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[53]
Constraints penalized q-learning for safe offline reinforcement learning
Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022
2022
-
[54]
Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer
Ruiqi Xue, Ziqian Zhang, Lihe Li, Cong Guan, Lei Yuan, and Yang Yu. Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[55]
Efficient reinforcement learning with large language model priors
Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[56]
Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning
Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 10299–10312, 2021
2021
-
[57]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
2023
-
[58]
Reachability constrained reinforcement learning
Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. InInternational conference on machine learning, pages 25636–25655, 2022
2022
-
[59]
Mopo: model-based offline policy optimization
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: model-based offline policy optimization. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 14129–14142, 2020
2020
-
[60]
Language to rewards for robotic skill synthesis
Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montserrat Gon- zalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. In7th Annual Conference on Robot Learning, 2023. 13
2023
-
[61]
Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles
Lixian Zhang, Ruixian Zhang, Tong Wu, Rui Weng, Minghao Han, and Ye Zhao. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE transactions on neural networks and learning systems, 32(12):5435–5444, 2021
2021
-
[62]
Safe offline reinforcement learning with feasibility-guided diffusion model
Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Safe offline reinforcement learning with feasibility-guided diffusion model. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[63]
C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning
LIU Zifan, Xinran Li, and Jun Zhang. C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning. InForty-second International Conference on Machine Learning, 2025. A Mathematical Proofs A.1 Proof of Proposition 4.1 Proof.First, we aim to proof that |max(a 1, b1)−max(a 2, b2)| ≤max(|a 1 −a 2|,|b 1 −b 2|).(15) To start, ifa 1 ...
2025
-
[64]
Any hazard reading r >= 0.90→unsafe
-
[65]
"" obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError(
Any pillar reading r >= 0.85→unsafe Returns ------- int 0 if safe, 1 if unsafe. """ obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError("Observation must be a 1D array with length >= 48.") # extract the 16-dim hazard and pillar readings hazard_r = obs[-48:-32] pillar_r = obs[-32:-16] # thresholds HAZARD_THRESH = 0.90 # exact ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.