arxiv: 2605.01356 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: unknown

Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

Ruiqi Xue , Lei Yuan , Kainuo Cheng , Jing-Wen Yang , Yang Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safe reinforcement learningoffline RLlarge language modelsmodel-based planningconstraint violationscounterfactual data generationfeasibility identificationSafety-Gymnasium

0 comments

The pith

By using large language models to define costs for unsafe states and a learned dynamics model to simulate future violations, PROCO enables learning of safe policies from offline data that has few or no examples of constraint violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard offline safe reinforcement learning breaks down on datasets with almost no unsafe examples because it cannot detect states that satisfy constraints now but will violate them soon. PROCO fixes this by grounding natural-language knowledge of risks into a conservative cost function via large language models, then using a learned dynamics model to roll out and synthesize many counterfactual unsafe trajectories. These synthetic samples let the method identify infeasible states and train policies that avoid them. The result matters for high-stakes settings where collecting risky data is impossible, allowing safer deployment without online trial-and-error.

Core claim

PROCO learns a dynamics model from the given offline data, builds a conservative cost function by grounding natural-language descriptions of unsafe states through large language models, and then runs model-based rollouts to create diverse counterfactual unsafe samples. These samples support reliable detection of safe-but-infeasible states and guide feasibility-aware policy optimization, yielding policies with fewer constraint violations than both the original offline safe RL methods and behavior-cloning baselines.

What carries the argument

The proactive cost generation process: an LLM-grounded conservative cost function combined with learned dynamics model rollouts that synthesize counterfactual unsafe samples for feasibility identification.

If this is right

Existing offline safe RL algorithms integrate directly with PROCO and show lower violation rates on the same limited-violation datasets.
Policies trained this way avoid states that satisfy constraints at the current step but lead to violations within a few steps.
The method outperforms pure behavior cloning baselines in both safety and task performance across tested environments.
It supports learning from datasets that contain exclusively safe trajectories without requiring any observed violations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding-plus-rollout pattern could apply in other domains where textual safety rules exist but violation examples are absent from data.
Improving the accuracy of the dynamics model or the LLM grounding step would likely amplify the safety gains without changing the core structure.
This suggests hybrid systems that mix learned models with external knowledge sources may reduce reliance on dangerous data collection in real-world robotics and control.

Load-bearing premise

The dynamics model learned from offline data must be accurate enough to produce reliable multi-step counterfactual trajectories that actually predict future violations.

What would settle it

On a Safety-Gymnasium task with clean data, replacing the learned dynamics model with the true simulator and finding that PROCO produces no reduction in constraint violations compared to the baseline method would show the cost generation step does not help.

Figures

Figures reproduced from arXiv: 2605.01356 by Jing-Wen Yang, Kainuo Cheng, Lei Yuan, Ruiqi Xue, Yang Yu.

**Figure 1.** Figure 1: Structure of PROCO. • V π h (s) ≤ 0 ⇒ ∀st, h(st) ≤ 0, indicating π can satisfy the hard-constraint starting from s. V ∗ h (s) ≤ 0 ⇒ ∃π, V π h (s) ≤ 0, meaning there exists a policy that satisfies the hard-constraint. • The feasible set and largest feasible set can be written as S π f := {s|V π h (s) ≤ 0}, S∗ f := {s|V ∗ h (s) ≤ 0}. (5) Based on Definition 3.1, once S ∗ f is obtained, the feasibility of sta… view at source ↗

**Figure 2.** Figure 2: (a) Visualization of the Circle task. (b) and (c) Visualization of FISOR and PROCO view at source ↗

**Figure 3.** Figure 3: (a) Analysis on rollout epoch E. (b), (c) Analysis on rollout length H. (d) Analysis on LLM selection. by restricting policies to safe-only data but still cannot reliably distinguish feasible from infeasible states, resulting in notable violations. Existing SOTA offline safe RL algorithms—LSPC, CAPS, and FISOR—perform well with many unsafe samples but fail on safe-only datasets, focusing on reward maximiza… view at source ↗

**Figure 4.** Figure 4: (a) Ablation study results. (b) Feasibility view at source ↗

**Figure 5.** Figure 5: Performance analysis under different unsafe data percent. view at source ↗

**Figure 6.** Figure 6: Tasks used in this paper. (a) Navigation Tasks based on Point and Car robots. (b) Velocity view at source ↗

read the original abstract

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PROCO adds LLM-grounded costs and model rollouts to create synthetic unsafe data for offline safe RL when real violations are absent, with experiments showing fewer constraint breaks on Safety-Gym tasks.

read the letter

The central idea is to learn a dynamics model from safe-only data, then use an LLM to define a conservative cost over states that look fine now but will violate soon. Model-based rollouts from the safe trajectories then generate the missing unsafe samples for feasibility-guided policy updates. This lets existing offline safe RL algorithms run without the usual requirement for violation examples in the dataset.

Referee Report

2 major / 2 minor

Summary. The paper proposes PROCO, a model-based offline safe RL method for learning constraint-satisfying policies from datasets containing exclusively safe or minimally risky trajectories. It learns a dynamics model from the offline data, grounds a conservative cost function in natural-language knowledge from LLMs to estimate risks without observed violations, performs model-based rollouts to synthesize counterfactual unsafe samples, and uses these to enable reliable feasibility identification and feasibility-guided policy learning. The approach integrates with existing offline safe RL algorithms and is evaluated on Safety-Gymnasium tasks, claiming reduced constraint violations and improved safety performance over baselines.

Significance. If the empirical claims hold under rigorous validation of the counterfactual samples, the work would address a key practical gap in offline safe RL: handling high-stakes settings where unsafe data cannot be collected. It combines LLM knowledge with model-based synthesis in a way that could extend to other domains with sparse violation data, provided the extrapolation issues are resolved.

major comments (2)

[Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.
[Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.

minor comments (2)

[Abstract] The abstract states empirical improvements but omits any mention of baselines, number of seeds, or statistical tests; these details should be summarized even in the abstract for clarity.
[Method] Notation for the conservative cost function and feasibility identification should be introduced with explicit equations rather than prose descriptions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the validation of the dynamics model and experimental details without altering the core claims.

read point-by-point responses

Referee: [Method] Method section (dynamics model and rollout procedure): The central claim depends on the learned dynamics model producing reliable counterfactual unsafe trajectories for feasibility identification. Because the model is trained exclusively on safe or minimally risky data, rollouts toward LLM-identified high-risk states necessarily involve extrapolation. The manuscript provides no error analysis, uncertainty quantification, or ablation on rollout accuracy (e.g., comparison of predicted vs. true violations in held-out unsafe states), which is load-bearing for the reported reductions in constraint violations.

Authors: We agree that the dynamics model, trained only on safe data, requires explicit validation when performing extrapolative rollouts to LLM-identified risky states. The conservative LLM-grounded cost function is intended to mitigate over-optimism by providing an independent risk signal, but this does not replace the need for rollout accuracy checks. In the revised manuscript we will add an error analysis subsection that reports prediction errors on simulated unsafe trajectories (generated via controlled perturbations of safe data), ensemble-based uncertainty estimates for the dynamics model, and an ablation comparing feasibility identification with and without rollout uncertainty thresholding. revision: yes
Referee: [Experiments] Experiments section (Safety-Gymnasium results): The abstract and claims assert consistent improvements across tasks with 'exclusively safe or minimally risky training data,' yet no details are given on how the offline datasets were constructed to ensure zero or near-zero violations, nor on statistical significance of the violation reductions versus baselines. This makes it impossible to assess whether the gains stem from the proposed LLM-grounded costs and counterfactuals or from other factors.

Authors: The referee correctly notes that additional transparency is required. The datasets were generated by rolling out near-optimal safe policies (with violation rates below 1% per episode) in Safety-Gymnasium environments; we will expand the experimental setup section with the precise data-collection procedure, per-task violation counts in the offline data, and the exact number of trajectories. We will also report statistical significance (paired t-tests with p-values and confidence intervals) for all violation-reduction results versus baselines to confirm that observed gains are attributable to the LLM-grounded costs and counterfactual synthesis rather than implementation differences. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external LLM grounding and standard model-based synthesis independent of fitted inputs

full rationale

The paper's claimed chain proceeds as: (1) fit dynamics model to offline safe trajectories, (2) construct conservative cost via LLM grounding of natural-language unsafe-state descriptions (external to data), (3) roll out the model from safe states toward LLM-identified high-risk regions to synthesize counterfactual unsafe samples, (4) use those samples for feasibility identification and policy improvement. None of these steps reduces by construction to its own inputs; the cost function is not fitted to the target violations or policy performance, the rollouts are genuine forward simulation (even if extrapolation quality is debatable on correctness grounds), and no self-citation or uniqueness theorem is invoked to force the architecture. The method therefore remains self-contained against the Safety-Gymnasium benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method introduces assumptions about model accuracy and LLM knowledge applicability, with no free parameters explicitly mentioned but likely present in implementation details not visible in abstract.

axioms (2)

domain assumption The dynamics model learned from offline data accurately predicts future states for rollouts.
Central to generating counterfactual samples.
ad hoc to paper LLMs can provide reliable natural language knowledge about unsafe states in the specific domain.
Used to construct conservative cost function without data.

invented entities (1)

Conservative cost function grounded in LLMs no independent evidence
purpose: To estimate risks in safe-but-infeasible states
New construct relying on LLM integration.

pith-pipeline@v0.9.0 · 5566 in / 1432 out tokens · 42420 ms · 2026-05-09T14:22:19.310907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? InThe Eleventh International Conference on Learning Representations, 2023

2023
[2]

A survey of llm-based methods for synthetic data generation and the rise of agentic workflows

Ahmad Alismail and Carsten Lanquillon. A survey of llm-based methods for synthetic data generation and the rise of agentic workflows. InInternational Conference on Human-Computer Interaction, pages 119–135. Springer, 2025

2025
[3]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

2021
[4]

Hamilton-jacobi reachability: A brief overview and recent advances

Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242–2253, 2017

2017
[5]

Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444, 2022

2022
[6]

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 2024

2024
[7]

Dime: Diffusion-based maximum entropy reinforcement learning

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning. In Forty-second International Conference on Machine Learning, 2025

2025
[8]

Constraint- adaptive policy switching for offline safe reinforcement learning

Yassine Chemingui, Aryan Deshwal, Honghao Wei, Alan Fern, and Jana Doppa. Constraint- adaptive policy switching for offline safe reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15722–15730, 2025

2025
[9]

Decision transformer: reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: reinforcement learning via sequence modeling. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 15084–15097, 2021

2021
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

2023
[11]

Safe rlhf: Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024. 10

2024
[12]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 53945–53968, 2024

2024
[13]

Consistency models as a rich and efficient policy class for reinforcement learning

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

2024
[14]

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforce- ment learning.arXiv preprint arXiv:1904.12901, 2019

work page arXiv 1904
[15]

Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. arXiv preprint arXiv:2505.22866, 2025

work page arXiv 2025
[16]

Bridging hamilton-jacobi safety analysis and reinforcement learning

Jaime F Fisac, Neil F Lugovoy, Vicenç Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton-jacobi safety analysis and reinforcement learning. In2019 International Conference on Robotics and Automation, pages 8550–8556, 2019

2019
[17]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062, 2019

2052
[18]

Iterative reachability estimation for safe reinforcement learning

Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, and Sicun Gao. Iterative reachability estimation for safe reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 69764–69797, 2023

2023
[19]

A comprehensive survey on safe reinforcement learning

Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015

2015
[20]

Worldgpt: Empowering llm as multimodal world model

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024

2024
[21]

A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[22]

Constraint-conditioned actor-critic for offline safe reinforcement learning

Zijian Guo, Weichao Zhou, Shengao Wang, and Wenchao Li. Constraint-conditioned actor-critic for offline safe reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[23]

When to trust your model: model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: model-based policy optimization. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 12519–12530, 2019

2019
[24]

Planning with diffusion for flexible behavior synthesis

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915, 2022

2022
[25]

Safety-gymnasium: a unified safe reinforcemei learning benchmark

Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Juntao Dai, and Yaodong Yang. Safety-gymnasium: a unified safe reinforcemei learning benchmark. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 18964–18993, 2023

2023
[26]

Smart-llm: Smart multi-agent robot task planning using large language models

Shyam Sundar Kannan, Vishnunandan LN Venkatesh, and Byung-Cheol Min. Smart-llm: Smart multi-agent robot task planning using large language models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 12140–12147. IEEE, 2024

2024
[27]

Morel: model-based offline reinforcement learning

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: model-based offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 21810–21823, 2020. 11

2020
[28]

Latent safety-constrained policy approach for safe offline reinforcement learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, and Cody Fleming. Latent safety-constrained policy approach for safe offline reinforcement learning. InThe Thirteenth International Confer- ence on Learning Representations, 2025

2025
[29]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

2022
[30]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 1179–1191, 2020

2020
[31]

Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation

Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. InInternational Conference on Learning Representations, 2022

2022
[32]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review arXiv 2005
[33]

Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination

Lihe Li, Lei Yuan, Pengsen Liu, Tao Jiang, and Yang Yu. Llm-assisted semantically diverse teammate generation for efficient multi-agent coordination. InForty-second International Conference on Machine Learning, 2025

2025
[34]

Reinforcement learning with action chunking

Yinchuan Li, Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Xiu Li, Zhitang Chen, et al. Generative models in decision making: A survey.arXiv preprint arXiv:2502.17100, 2025

work page arXiv 2025
[35]

Constrained decision transformer for offline safe reinforcement learning

Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, and Ding Zhao. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pages 21611–21630, 2023

2023
[36]

Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024

Zuxin Liu, Zijian Guo, Haohong Lin, Yihang Yao, Jiacheng Zhu, Zhepeng Cen, Hanjiang Hu, Wenhao Yu, Tingnan Zhang, Jie Tan, and Ding Zhao. Datasets and benchmarks for offline safe reinforcement learning.Journal of Data-centric Machine Learning Research, 2024

2024
[37]

Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025
[38]

Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

work page arXiv 2025
[39]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[40]

Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025

Jing Nie, Jiachen Jiang, Yang Li, Huting Wang, Sezai Ercisli, and Linze Lv. Data and domain knowledge dual-driven artificial intelligence: Survey, applications, and challenges.Expert Systems, 42(1):e13425, 2025

2025
[41]

History compression via language models in reinforcement learning

Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-Zadeh, and Sepp Hochreiter. History compression via language models in reinforcement learning. InInternational Conference on Machine Learning, pages 17156–17185, 2022

2022
[42]

Flow q-learning.arXiv:2502.02538, 2025

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025
[43]

Adapt: As-needed decomposition and planning with language models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

work page arXiv 2023
[44]

A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023

Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 2023. 12

2023
[45]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 8634–8652, 2023

2023
[46]

Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics.arXiv preprint arXiv:2309.06687, 2023

Jiayang Song, Zhehua Zhou, Jiawei Liu, Chunrong Fang, Zhan Shu, and Lei Ma. Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics.arXiv preprint arXiv:2309.06687, 2023

work page arXiv 2023
[47]

Responsive safety in reinforcement learning by pid lagrangian methods

Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. InInternational Conference on Machine Learning, pages 9133– 9143, 2020

2020
[48]

Model-bellman inconsistency for model-based offline reinforcement learning

Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. InInternational Conference on Machine Learning, pages 33177–33194, 2023

2023
[49]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033, 2012

2012
[50]

Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks

Eleftherios Triantafyllidis, Filippos Christianos, and Zhibin Li. Intrinsic language-guided exploration for complex long-horizon robotic manipulation tasks. In2024 IEEE International Conference on Robotics and Automation, pages 7493–7500, 2024

2024
[51]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

2023
[52]

Text2reward: Reward shaping with language models for reinforcement learning

Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024

2024
[53]

Constraints penalized q-learning for safe offline reinforcement learning

Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

2022
[54]

Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer

Ruiqi Xue, Ziqian Zhang, Lihe Li, Cong Guan, Lei Yuan, and Yang Yu. Adaptable safe policy learning from multi-task data with constraint prioritized decision transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[55]

Efficient reinforcement learning with large language model priors

Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[56]

Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning

Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. Believe what you see: implicit constraint approach for offline multi- agent reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 10299–10312, 2021

2021
[57]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[58]

Reachability constrained reinforcement learning

Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. InInternational conference on machine learning, pages 25636–25655, 2022

2022
[59]

Mopo: model-based offline policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: model-based offline policy optimization. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 14129–14142, 2020

2020
[60]

Language to rewards for robotic skill synthesis

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montserrat Gon- zalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. In7th Annual Conference on Robot Learning, 2023. 13

2023
[61]

Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles

Lixian Zhang, Ruixian Zhang, Tong Wu, Rui Weng, Minghao Han, and Ye Zhao. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE transactions on neural networks and learning systems, 32(12):5435–5444, 2021

2021
[62]

Safe offline reinforcement learning with feasibility-guided diffusion model

Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Safe offline reinforcement learning with feasibility-guided diffusion model. In The Twelfth International Conference on Learning Representations, 2024

2024
[63]

C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning

LIU Zifan, Xinran Li, and Jun Zhang. C2iql: Constraint-conditioned implicit q-learning for safe offline reinforcement learning. InForty-second International Conference on Machine Learning, 2025. A Mathematical Proofs A.1 Proof of Proposition 4.1 Proof.First, we aim to proof that |max(a 1, b1)−max(a 2, b2)| ≤max(|a 1 −a 2|,|b 1 −b 2|).(15) To start, ifa 1 ...

2025
[64]

Any hazard reading r >= 0.90→unsafe
[65]

"" obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError(

Any pillar reading r >= 0.85→unsafe Returns ------- int 0 if safe, 1 if unsafe. """ obs = np.asarray(observation) if obs.ndim != 1 or obs.size < 48: raise ValueError("Observation must be a 1D array with length >= 48.") # extract the 16-dim hazard and pillar readings hazard_r = obs[-48:-32] pillar_r = obs[-32:-16] # thresholds HAZARD_THRESH = 0.90 # exact ...