pith. machine review for the scientific record. sign in

arxiv: 2605.01356 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: unknown

Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords safe reinforcement learningoffline RLlarge language modelsmodel-based planningconstraint violationscounterfactual data generationfeasibility identificationSafety-Gymnasium
0
0 comments X

The pith

By using large language models to define costs for unsafe states and a learned dynamics model to simulate future violations, PROCO enables learning of safe policies from offline data that has few or no examples of constraint violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard offline safe reinforcement learning breaks down on datasets with almost no unsafe examples because it cannot detect states that satisfy constraints now but will violate them soon. PROCO fixes this by grounding natural-language knowledge of risks into a conservative cost function via large language models, then using a learned dynamics model to roll out and synthesize many counterfactual unsafe trajectories. These synthetic samples let the method identify infeasible states and train policies that avoid them. The result matters for high-stakes settings where collecting risky data is impossible, allowing safer deployment without online trial-and-error.

Core claim

PROCO learns a dynamics model from the given offline data, builds a conservative cost function by grounding natural-language descriptions of unsafe states through large language models, and then runs model-based rollouts to create diverse counterfactual unsafe samples. These samples support reliable detection of safe-but-infeasible states and guide feasibility-aware policy optimization, yielding policies with fewer constraint violations than both the original offline safe RL methods and behavior-cloning baselines.

What carries the argument

The proactive cost generation process: an LLM-grounded conservative cost function combined with learned dynamics model rollouts that synthesize counterfactual unsafe samples for feasibility identification.

If this is right

  • Existing offline safe RL algorithms integrate directly with PROCO and show lower violation rates on the same limited-violation datasets.
  • Policies trained this way avoid states that satisfy constraints at the current step but lead to violations within a few steps.
  • The method outperforms pure behavior cloning baselines in both safety and task performance across tested environments.
  • It supports learning from datasets that contain exclusively safe trajectories without requiring any observed violations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grounding-plus-rollout pattern could apply in other domains where textual safety rules exist but violation examples are absent from data.
  • Improving the accuracy of the dynamics model or the LLM grounding step would likely amplify the safety gains without changing the core structure.
  • This suggests hybrid systems that mix learned models with external knowledge sources may reduce reliance on dangerous data collection in real-world robotics and control.

Load-bearing premise

The dynamics model learned from offline data must be accurate enough to produce reliable multi-step counterfactual trajectories that actually predict future violations.

What would settle it

On a Safety-Gymnasium task with clean data, replacing the learned dynamics model with the true simulator and finding that PROCO produces no reduction in constraint violations compared to the baseline method would show the cost generation step does not help.

Figures

Figures reproduced from arXiv: 2605.01356 by Jing-Wen Yang, Kainuo Cheng, Lei Yuan, Ruiqi Xue, Yang Yu.

Figure 1
Figure 1. Figure 1: Structure of PROCO. • V π h (s) ≤ 0 ⇒ ∀st, h(st) ≤ 0, indicating π can satisfy the hard-constraint starting from s. V ∗ h (s) ≤ 0 ⇒ ∃π, V π h (s) ≤ 0, meaning there exists a policy that satisfies the hard-constraint. • The feasible set and largest feasible set can be written as S π f := {s|V π h (s) ≤ 0}, S∗ f := {s|V ∗ h (s) ≤ 0}. (5) Based on Definition 3.1, once S ∗ f is obtained, the feasibility of sta… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Visualization of the Circle task. (b) and (c) Visualization of FISOR and PROCO view at source ↗
Figure 3
Figure 3. Figure 3: (a) Analysis on rollout epoch E. (b), (c) Analysis on rollout length H. (d) Analysis on LLM selection. by restricting policies to safe-only data but still cannot reliably distinguish feasible from infeasible states, resulting in notable violations. Existing SOTA offline safe RL algorithms—LSPC, CAPS, and FISOR—perform well with many unsafe samples but fail on safe-only datasets, focusing on reward maximiza… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Ablation study results. (b) Feasibility view at source ↗
Figure 5
Figure 5. Figure 5: Performance analysis under different unsafe data percent. view at source ↗
Figure 6
Figure 6. Figure 6: Tasks used in this paper. (a) Navigation Tasks based on Point and Car robots. (b) Velocity view at source ↗
read the original abstract

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PROCO, a model-based offline safe RL method for learning constraint-satisfying policies from datasets containing exclusively safe or minimally risky trajectories. It learns a dynamics model from the offline data, grounds a conservative cost function in natural-language knowledge from LLMs to estimate risks without observed violations, performs model-based rollouts to synthesize counterfactual unsafe samples, and uses these to enable reliable feasibility identification and feasibility-guided policy learning. The approach integrates with existing offline safe RL algorithms and is evaluated on Safety-Gymnasium tasks, claiming reduced constraint violations and improved safety performance over baselines.

Significance. If the empirical claims hold under rigorous validation of the counterfactual samples, the work would address a key practical gap in offline safe RL: handling high-stakes settings where unsafe data cannot be collected. It combines LLM knowledge with model-based synthesis in a way that could extend to other domains with sparse violation data, provided the extrapolation issues are resolved.

major comments (2)
  1. [Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.
  2. [Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.
minor comments (2)
  1. [Abstract] The abstract states empirical improvements but omits any mention of baselines, number of seeds, or statistical tests; these details should be summarized even in the abstract for clarity.
  2. [Method] Notation for the conservative cost function and feasibility identification should be introduced with explicit equations rather than prose descriptions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the validation of the dynamics model and experimental details without altering the core claims.

read point-by-point responses
  1. Referee: [Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.

    Authors: We agree that the dynamics model, trained only on safe data, requires explicit validation when performing extrapolative rollouts to LLM-identified risky states. The conservative LLM-grounded cost function is intended to mitigate over-optimism by providing an independent risk signal, but this does not replace the need for rollout accuracy checks. In the revised manuscript we will add an error analysis subsection that reports prediction errors on simulated unsafe trajectories (generated via controlled perturbations of safe data), ensemble-based uncertainty estimates for the dynamics model, and an ablation comparing feasibility identification with and without rollout uncertainty thresholding. revision: yes

  2. Referee: [Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.

    Authors: The referee correctly notes that additional transparency is required. The datasets were generated by rolling out near-optimal safe policies (with violation rates below 1% per episode) in Safety-Gymnasium environments; we will expand the experimental setup section with the precise data-collection procedure, per-task violation counts in the offline data, and the exact number of trajectories. We will also report statistical significance (paired t-tests with p-values and confidence intervals) for all violation-reduction results versus baselines to confirm that observed gains are attributable to the LLM-grounded costs and counterfactual synthesis rather than implementation differences. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external LLM grounding and standard model-based synthesis independent of fitted inputs

full rationale

The paper's claimed chain proceeds as: (1) fit dynamics model to offline safe trajectories, (2) construct conservative cost via LLM grounding of natural-language unsafe-state descriptions (external to data), (3) roll out the model from safe states toward LLM-identified high-risk regions to synthesize counterfactual unsafe samples, (4) use those samples for feasibility identification and policy improvement. None of these steps reduces by construction to its own inputs; the cost function is not fitted to the target violations or policy performance, the rollouts are genuine forward simulation (even if extrapolation quality is debatable on correctness grounds), and no self-citation or uniqueness theorem is invoked to force the architecture. The method therefore remains self-contained against the Safety-Gymnasium benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method introduces assumptions about model accuracy and LLM knowledge applicability, with no free parameters explicitly mentioned but likely present in implementation details not visible in abstract.

axioms (2)
  • domain assumption The dynamics model learned from offline data accurately predicts future states for rollouts.
    Central to generating counterfactual samples.
  • ad hoc to paper LLMs can provide reliable natural language knowledge about unsafe states in the specific domain.
    Used to construct conservative cost function without data.
invented entities (1)
  • Conservative cost function grounded in LLMs no independent evidence
    purpose: To estimate risks in safe-but-infeasible states
    New construct relying on LLM integration.

pith-pipeline@v0.9.0 · 5566 in / 1432 out tokens · 42420 ms · 2026-05-09T14:22:19.310907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023

  2. [2]

    A survey of llm-based methods for synthetic data generation and the rise of agentic workflows

    Ahmad Alismail and Carsten Lanquillon. A survey of llm-based methods for synthetic data generation and the rise of agentic workflows. InInternational Conference on Human-Computer Interaction, pages 119–135. Springer, 2025

  3. [3]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  4. [4]

    Hamilton-jacobi reachability: A brief overview and recent advances

    Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242–2253, 2017

  5. [5]

    Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

    Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

  6. [6]

    Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024

    Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024

  7. [7]

    Dime: Diffusion-based maximum entropy reinforcement learning

    Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning. In Forty-second International Conference on Machine Learning, 2025

  8. [8]

    Constraint- adaptive policy switching for offline safe reinforcement learning

    Yassine Chemingui, Aryan Deshwal, Honghao Wei, Alan Fern, and Jana Doppa. Constraint- adaptive policy switching for offline safe reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15722–15730, 2025

  9. [9]

    Decision transformer: reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: reinforcement learning via sequence modeling. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 15084–15097, 2021

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

  11. [11]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024. 10

  12. [12]

    Diffusion-based reinforcement learning via q-weighted variational policy optimization

    Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 53945–53968, 2024

  13. [13]

    Consistency models as a rich and efficient policy class for reinforcement learning

    Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

  14. [14]

    Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester

    Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforce- ment learning.arXiv preprint arXiv:1904.12901, 2019

  15. [15]

    Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

    Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. arXiv preprint arXiv:2505.22866, 2025

  16. [16]

    Bridging hamilton-jacobi safety analysis and reinforcement learning

    Jaime F Fisac, Neil F Lugovoy, Vicenç Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton-jacobi safety analysis and reinforcement learning. In2019 International Conference on Robotics and Automation, pages 8550–8556, 2019

  17. [17]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062, 2019

  18. [18]

    Iterative reachability estimation for safe reinforcement learning

    Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, and Sicun Gao. Iterative reachability estimation for safe reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 69764–69797, 2023

  19. [19]

    A comprehensive survey on safe reinforcement learning

    Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015

  20. [20]

    Worldgpt: Empowering llm as multimodal world model

    Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024

  21. [21]

    A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  22. [22]

    Constraint-conditioned actor-critic for offline safe reinforcement learning

    Zijian Guo, Weichao Zhou, Shengao Wang, and Wenchao Li. Constraint-conditioned actor-critic for offline safe reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    When to trust your model: model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: model-based policy optimization. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 12519–12530, 2019

  24. [24]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915, 2022

  25. [25]

    Safety-gymnasium: a unified safe reinforcemei learning benchmark

    Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Juntao Dai, and Yaodong Yang. Safety-gymnasium: a unified safe reinforcemei learning benchmark. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 18964–18993, 2023

  26. [26]

    Smart-llm: Smart multi-agent robot task planning using large language models

    Shyam Sundar Kannan, Vishnunandan LN Venkatesh, and Byung-Cheol Min. Smart-llm: Smart multi-agent robot task planning using large language models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 12140–12147. IEEE, 2024

  27. [27]

    Morel: model-based offline reinforcement learning

    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: model-based offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 21810–21823, 2020. 11

  28. [28]

    Latent safety-constrained policy approach for safe offline reinforcement learning

    Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, and Cody Fleming. Latent safety-constrained policy approach for safe offline reinforcement learning. InThe Thirteenth International Confer- ence on Learning Representations, 2025

  29. [29]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

  30. [30]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 1179–1191, 2020

  31. [31]

    Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation

    Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. InInternational Conference on Learning Representations, 2022

  32. [32]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  33. [33]

    Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination

    Lihe Li, Lei Yuan, Pengsen Liu, Tao Jiang, and Yang Yu. Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination. InForty-second International Conference on Machine Learning, 2025

  34. [34]

    Reinforcement learning with action chunking

    Yinchuan Li, Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Xiu Li, Zhitang Chen, et al. Generative models in decision making: A survey.arXiv preprint arXiv:2502.17100, 2025

  35. [35]

    Constrained decision transformer for offline safe reinforcement learning

    Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, and Ding Zhao. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pages 21611–21630, 2023

  36. [36]

    Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024

    Zuxin Liu, Zijian Guo, Haohong Lin, Yihang Yao, Jiacheng Zhu, Zhepeng Cen, Hanjiang Hu, Wenhao Yu, Tingnan Zhang, Jie Tan, and Ding Zhao. Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024

  37. [37]

    Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

    Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

  38. [38]

    Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

  39. [39]

    Eureka: Human-level reward design via coding large language models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe Twelfth International Conference on Learning Representations, 2024

  40. [40]

    Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025

    Jing Nie, Jiachen Jiang, Yang Li, Huting Wang, Sezai Ercisli, and Linze Lv. Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025

  41. [41]

    History compression via language models in reinforcement learning

    Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-Zadeh, and Sepp Hochreiter. History compression via language models in reinforcement learning. InInternational Conference on Machine Learning, pages 17156–17185, 2022

  42. [42]

    Flow q-learning.arXiv:2502.02538, 2025

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv preprint arXiv:2502.02538, 2025

  43. [43]

    Adapt: As-needed decomposition and planning with language models

    Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

  44. [44]

    A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023

    Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023. 12

  45. [45]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023

  46. [46]

    Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics.arXiv preprint arXiv:2309.06687, 2023

    Jiayang Song, Zhehua Zhou, Jiawei Liu, Chunrong Fang, Zhan Shu, and Lei Ma. Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics.arXiv preprint arXiv:2309.06687, 2023

  47. [47]

    Responsive safety in reinforcement learning by pid lagrangian methods

    Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. InInternational Conference on Machine Learning, pages 9133– 9143, 2020

  48. [48]

    Model-bellman inconsistency for model-based offline reinforcement learning

    Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. InInternational Conference on Machine Learning, pages 33177–33194, 2023

  49. [49]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033, 2012

  50. [50]

    Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks

    Eleftherios Triantafyllidis, Filippos Christianos, and Zhibin Li. Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks. In2024 IEEE International Conference on Robotics and Automation, pages 7493–7500, 2024

  51. [51]

    Diffusion policies as an expressive policy class for offline reinforcement learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

  52. [52]

    Text2reward: Reward shaping with language models for reinforcement learning

    Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024

  53. [53]

    Constraints penalized q-learning for safe offline reinforcement learning

    Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

  54. [54]

    Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer

    Ruiqi Xue, Ziqian Zhang, Lihe Li, Cong Guan, Lei Yuan, and Yang Yu. Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  55. [55]

    Efficient reinforcement learning with large language model priors

    Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. InThe Thirteenth International Conference on Learning Representations, 2025

  56. [56]

    Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning

    Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 10299–10312, 2021

  57. [57]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  58. [58]

    Reachability constrained reinforcement learning

    Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. InInternational conference on machine learning, pages 25636–25655, 2022

  59. [59]

    Mopo: model-based offline policy optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: model-based offline policy optimization. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 14129–14142, 2020

  60. [60]

    Language to rewards for robotic skill synthesis

    Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montserrat Gon- zalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. In7th Annual Conference on Robot Learning, 2023. 13

  61. [61]

    Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles

    Lixian Zhang, Ruixian Zhang, Tong Wu, Rui Weng, Minghao Han, and Ye Zhao. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE transactions on neural networks and learning systems, 32(12):5435–5444, 2021

  62. [62]

    Safe offline reinforcement learning with feasibility-guided diffusion model

    Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Safe offline reinforcement learning with feasibility-guided diffusion model. In The Twelfth International Conference on Learning Representations, 2024

  63. [63]

    C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning

    LIU Zifan, Xinran Li, and Jun Zhang. C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning. InForty-second International Conference on Machine Learning, 2025. A Mathematical Proofs A.1 Proof of Proposition 4.1 Proof.First, we aim to proof that |max(a 1, b1)−max(a 2, b2)| ≤max(|a 1 −a 2|,|b 1 −b 2|).(15) To start, ifa 1 ...

  64. [64]

    Any hazard reading r >= 0.90→unsafe

  65. [65]

    "" obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError(

    Any pillar reading r >= 0.85→unsafe Returns ------- int 0 if safe, 1 if unsafe. """ obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError("Observation must be a 1D array with length >= 48.") # extract the 16-dim hazard and pillar readings hazard_r = obs[-48:-32] pillar_r = obs[-32:-16] # thresholds HAZARD_THRESH = 0.90 # exact ...