Safe-RULE: Safe Reinforcement UnLEarning

Fanxin Kong; Shixiong Jiang; Taozheng Zhu

arxiv: 2606.09559 · v1 · pith:VZIMVLI2new · submitted 2026-06-08 · 💻 cs.LG · cs.AI· cs.CR· cs.RO

Safe-RULE: Safe Reinforcement UnLEarning

Shixiong Jiang , Taozheng Zhu , Fanxin Kong This is my paper

Pith reviewed 2026-06-27 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CRcs.RO

keywords safe reinforcement learningdata poisoningreinforcement unlearningoffline RLsafety constraintsdefense frameworkpolicy repair

0 comments

The pith

Safe-RULE removes the effects of poisoned data from offline safe reinforcement learning policies through targeted unlearning that preserves both performance and safety constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces safe reinforcement unlearning as a defense against data poisoning in offline Safe RL, where static datasets can be corrupted to produce unsafe policies. It shows that unlearning can excise the influence of malicious samples without restarting training or revisiting the original environment. The method explicitly balances task reward and safety limits during the unlearning step itself. Experiments on standard Safe RL benchmarks indicate improved safety metrics after unlearning compared with poisoned baselines.

Core claim

Safe-RULE is a reinforcement unlearning procedure that, given a poisoned offline dataset and a trained policy, produces a new policy whose behavior satisfies safety constraints and retains task performance by explicitly penalizing retention of poisoned-sample effects during the unlearning updates.

What carries the argument

The Safe-RULE unlearning objective that jointly optimizes task performance and safety-constraint satisfaction while erasing poisoned-sample influence.

If this is right

Offline Safe RL agents can be repaired after poisoning without access to the live environment.
Safety-critical policies trained on static datasets become more robust to training-time data attacks.
Unlearning can serve as a post-training defense layer that does not require re-collecting clean data.
The same unlearning loop can be applied whenever new poisoned samples are detected in an existing dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on datasets that contain both poisoned and naturally occurring unsafe trajectories to check whether unlearning distinguishes the two.
If the unlearning step proves stable across different constraint formulations, it might generalize to other constrained learning settings beyond RL.
Integration with continual-learning pipelines would allow periodic unlearning passes as new data arrives.

Load-bearing premise

Poisoned-sample effects can be isolated and removed by unlearning while still keeping both reward and safety performance intact without the original training environment or full retraining.

What would settle it

A controlled experiment in which, after Safe-RULE is applied to a known poisoned dataset, the resulting policy either violates the original safety constraints or shows substantially lower task return than the unpoisoned baseline.

Figures

Figures reproduced from arXiv: 2606.09559 by Fanxin Kong, Shixiong Jiang, Taozheng Zhu.

**Figure 3.** Figure 3: Evaluation cost and reward under the Min Reward poisoning attack. Although this attack [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Safe-RULE is an idea for unlearning poisoned data in offline safe RL while preserving safety, but the abstract supplies no methods or results so the claim cannot be checked.

read the letter

The core pitch is a defense called Safe-RULE that removes the effect of poisoned samples from an offline safe RL dataset without retraining from scratch or needing the original environment, while still respecting both task reward and safety constraints.

What is actually new is the explicit extension of reinforcement unlearning ideas to the offline safe setting; most prior unlearning work has not had to carry safety constraints through the process. The paper correctly flags a real deployment risk: static datasets for robotics or other safety-critical systems can be attacked, and full retraining is often impractical.

The main weakness is that none of this is shown. The abstract states that experiments demonstrate effectiveness but gives no equations, no algorithm, no dataset details, no baselines, and no numbers. Without those, it is impossible to tell whether the unlearning step actually succeeds, whether safety is maintained, or whether the method collapses to something already known. The central feasibility claim—that you can excise poisoned influence while keeping both performance and constraint satisfaction—remains an assumption rather than a demonstrated result.

This is aimed at people working on safe RL and data poisoning. A reader gets almost no value from the current version because there is nothing technical to engage with. It does not deserve a serious referee right now; the work would need the full methods, proofs or derivations if any, and reproducible results before it is worth sending out.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Safe-RULE, a new learning paradigm for safe reinforcement unlearning in offline Safe RL. It functions as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. The approach extends reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during unlearning. Experiments across benchmark Safe RL tasks are stated to demonstrate that the method effectively enhances safety performance against data poisoning attacks.

Significance. If the method can be shown to achieve the claimed unlearning while preserving both performance and safety constraints, the work would address an important practical vulnerability in offline Safe RL for safety-critical applications such as robotics, offering an alternative to full retraining.

major comments (1)

[Abstract] Abstract: The claim that 'Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks' is made without any methods, equations, data details, or results. This absence is load-bearing because the central claim of effective unlearning without original-environment access or full retraining rests entirely on the (unshown) empirical support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the role of the abstract. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks' is made without any methods, equations, data details, or results. This absence is load-bearing because the central claim of effective unlearning without original-environment access or full retraining rests entirely on the (unshown) empirical support.

Authors: Abstracts are intentionally concise high-level summaries and do not contain methods, equations, or results; those elements appear in the main manuscript. The Safe-RULE formulation, including the objective that jointly optimizes task return and safety constraint satisfaction during unlearning, is derived in Section 3. The practical algorithm that performs the unlearning step without access to the original environment is given in Section 4 together with the relevant update rules. Section 5 specifies the benchmark environments, the data-poisoning attack model, the evaluation metrics (including safety violation rate and task return), and reports quantitative results showing that the unlearned policies recover safety performance while preserving task return. Because the empirical support is fully documented in the body of the paper, the abstract claim is not unsupported. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Safe-RULE as a new paradigm for safe reinforcement unlearning in offline Safe RL to counter data poisoning. The abstract and description present a methodological proposal and experimental validation without any visible equations, derivations, parameter fittings, or self-citations that reduce claims to inputs by construction. No load-bearing steps match the enumerated circularity patterns (self-definitional, fitted predictions, etc.). The work is self-contained as a proposal with external benchmarks via experiments, yielding a normal non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or evaluated.

pith-pipeline@v0.9.1-grok · 5657 in / 1048 out tokens · 19524 ms · 2026-06-27T17:26:30.986499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Sinha, A

S. Sinha, A. Mandlekar, and A. Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. InConference on Robot Learning, pages 907–917. PMLR, 2022

2022
[2]

J. Li, X. Liu, B. Zhu, J. Jiao, M. Tomizuka, C. Tang, and W. Zhan. Guided online distillation: Promoting safe reinforcement learning by offline demonstration. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7447–7454. IEEE, 2024

2024
[3]

T. Shi, D. Chen, K. Chen, and Z. Li. Offline reinforcement learning for autonomous driving with safety and exploration enhancement.arXiv preprint arXiv:2110.07067, 2021

work page arXiv 2021
[4]

X. Fang, Q. Zhang, Y . Gao, and D. Zhao. Offline reinforcement learning for autonomous driving with real world driving data. In2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pages 3417–3422. IEEE, 2022

2022
[5]

S. Gu, L. Yang, Y . Du, G. Chen, F. Walter, J. Wang, and A. Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[6]

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, et al. Baffle: Hiding backdoors in offline reinforcement learning datasets. In2024 IEEE Symposium on Security and Privacy (SP), pages 2086–2104. IEEE, 2024

2086
[7]

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, et al. Baffle: Backdoor attack in offline reinforcement learning.arXiv preprint arXiv:2210.04688, 2022

work page arXiv 2022
[8]

D. Ye, T. Zhu, C. Zhu, D. Wang, K. Gao, Z. Shi, S. Shen, W. Zhou, and M. Xue. Reinforcement unlearning.arXiv preprint arXiv:2312.15910, 2023

work page arXiv 2023
[9]

Kiourti, K

P. Kiourti, K. Wardega, S. Jha, and W. Li. Trojdrl: evaluation of backdoor attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020

2020
[10]

Jiang, M

S. Jiang, M. Liu, and F. Kong. Backdoor attacks on safe reinforcement learning-enabled cyber– physical systems.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(11):4093–4104, 2024

2024
[11]

C. Gong, K. Li, J. Yao, and T. Wang. Trajdeleter: Enabling trajectory forgetting in offline reinforcement learning agents.arXiv preprint arXiv:2404.12530, 2024

work page arXiv 2024
[12]

X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, and D. Xiong. Depn: Detecting and editing privacy neurons in pretrained language models.arXiv preprint arXiv:2310.20138, 2023

work page arXiv 2023
[13]

Farrell, Y .-T

E. Farrell, Y .-T. Lau, and A. Conmy. Applying sparse autoencoders to unlearn knowledge in language models.arXiv preprint arXiv:2410.19278, 2024

work page arXiv 2024
[14]

Deeb and F

A. Deeb and F. Roger. Do unlearning methods remove information from language model weights?arXiv preprint arXiv:2410.08827, 2024

work page arXiv 2024
[15]

Sheshadri, A

A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V . Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

work page arXiv 2024
[16]

Isonuma and I

M. Isonuma and I. Titov. Unlearning traces the influential training data of language models. arXiv preprint arXiv:2401.15241, 2024

work page arXiv 2024
[17]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 9

2022
[18]

À. P. Vidal, A. S. Johansen, M. N. Jahromi, S. Escalera, K. Nasrollahi, and T. B. Moeslund. Verifying machine unlearning with explainable ai. InInternational Conference on Pattern Recognition, pages 458–473. Springer, 2024

2024
[19]

N. Sula, A. Kumar, J. Hou, H. Wang, and R. Tourani. Silver linings in the shadows: Harnessing membership inference for machine unlearning.arXiv preprint arXiv:2407.00866, 2024

work page arXiv 2024
[20]

K. Gu, M. R. U. Rashid, N. Sultana, and S. Mehnaz. Second-order information matters: Revisiting machine unlearning for large language models.arXiv preprint arXiv:2403.10557, 2024

work page arXiv 2024
[21]

J. Jang, D. Yoon, S. Yang, S. Cha, M. Lee, L. Logeswaran, and M. Seo. Knowledge unlearning for mitigating privacy risks in language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389–14408, 2023

2023
[22]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Zhang, J

J. Zhang, J. Liu, J. He, et al. Composing parameter-efficient modules with arithmetic operation. Advances in Neural Information Processing Systems, 36:12589–12610, 2023

2023
[24]

arXiv preprint arXiv:2408.06223 , year =

D. Huu-Tien, T.-T. Pham, H. Thanh-Tung, and N. Inoue. On effects of steering latent represen- tation for large language model unlearning.arXiv preprint arXiv:2408.06223, 2024

work page arXiv 2024
[25]

Rosati, J

D. Rosati, J. Wehner, K. Williams, L. Bartoszcze, R. Gonzales, S. Majumdar, H. Sajjad, F. Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024

2024
[26]

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Behzadan and A

V . Behzadan and A. Munir. Vulnerability of deep reinforcement learning to policy induction attacks. InMachine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, pages 262–275. Springer, 2017

2017
[28]

Huang and Q

Y . Huang and Q. Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. InDecision and Game Theory for Security: 10th International Conference, GameSec 2019, Stockholm, Sweden, October 30–November 1, 2019, Proceedings 10, pages 217–237. Springer, 2019

2019
[29]

Rakhsha, G

A. Rakhsha, G. Radanovic, R. Devidze, X. Zhu, and A. Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. InInternational Conference on Machine Learning, pages 7974–7984. PMLR, 2020

2020
[30]

Liu and L

G. Liu and L. Lai. Provably efficient black-box action poisoning attacks against reinforcement learning.Advances in Neural Information Processing Systems, 34:12400–12410, 2021

2021
[31]

Lin, Z.-W

Y .-C. Lin, Z.-W. Hong, Y .-H. Liao, M.-L. Shih, M.-Y . Liu, and M. Sun. Tactics of adversarial attack on deep reinforcement learning agents.arXiv preprint arXiv:1703.06748, 2017

work page arXiv 2017
[32]

Z. Liu, Z. Guo, Z. Cen, H. Zhang, J. Tan, B. Li, and D. Zhao. On the robustness of safe reinforcement learning under observational perturbations.arXiv preprint arXiv:2205.14691, 2022

work page arXiv 2022
[33]

W. Guo, G. Liu, Z. Zhou, and L. Wang. Pnact: Crafting backdoor attacks in safe reinforcement learning.arXiv preprint arXiv:2507.00485, 2025. 10

work page arXiv 2025
[34]

H. Xu, X. Zhan, and X. Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

2022
[35]

J. Lee, C. Paduraru, D. J. Mankowitz, N. Heess, D. Precup, K.-E. Kim, and A. Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022

work page arXiv 2022
[36]

J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang. Safety gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems, 36, 2023

2023
[37]

Z. Liu, Z. Guo, H. Lin, Y . Yao, J. Zhu, Z. Cen, H. Hu, W. Yu, T. Zhang, J. Tan, et al. Datasets and benchmarks for offline safe reinforcement learning.arXiv preprint arXiv:2306.09303, 2023

work page arXiv 2023
[38]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

2052
[39]

Kumar, J

A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019. 11 A Theoretical Analysis In this section, we provide a theoretical explanation for two design choices in Safe-RULE: the reward reference ¯Qr and the safety margin σ in the forg...

2019

[1] [1]

Sinha, A

S. Sinha, A. Mandlekar, and A. Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. InConference on Robot Learning, pages 907–917. PMLR, 2022

2022

[2] [2]

J. Li, X. Liu, B. Zhu, J. Jiao, M. Tomizuka, C. Tang, and W. Zhan. Guided online distillation: Promoting safe reinforcement learning by offline demonstration. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7447–7454. IEEE, 2024

2024

[3] [3]

T. Shi, D. Chen, K. Chen, and Z. Li. Offline reinforcement learning for autonomous driving with safety and exploration enhancement.arXiv preprint arXiv:2110.07067, 2021

work page arXiv 2021

[4] [4]

X. Fang, Q. Zhang, Y . Gao, and D. Zhao. Offline reinforcement learning for autonomous driving with real world driving data. In2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pages 3417–3422. IEEE, 2022

2022

[5] [5]

S. Gu, L. Yang, Y . Du, G. Chen, F. Walter, J. Wang, and A. Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[6] [6]

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, et al. Baffle: Hiding backdoors in offline reinforcement learning datasets. In2024 IEEE Symposium on Security and Privacy (SP), pages 2086–2104. IEEE, 2024

2086

[7] [7]

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, et al. Baffle: Backdoor attack in offline reinforcement learning.arXiv preprint arXiv:2210.04688, 2022

work page arXiv 2022

[8] [8]

D. Ye, T. Zhu, C. Zhu, D. Wang, K. Gao, Z. Shi, S. Shen, W. Zhou, and M. Xue. Reinforcement unlearning.arXiv preprint arXiv:2312.15910, 2023

work page arXiv 2023

[9] [9]

Kiourti, K

P. Kiourti, K. Wardega, S. Jha, and W. Li. Trojdrl: evaluation of backdoor attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020

2020

[10] [10]

Jiang, M

S. Jiang, M. Liu, and F. Kong. Backdoor attacks on safe reinforcement learning-enabled cyber– physical systems.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(11):4093–4104, 2024

2024

[11] [11]

C. Gong, K. Li, J. Yao, and T. Wang. Trajdeleter: Enabling trajectory forgetting in offline reinforcement learning agents.arXiv preprint arXiv:2404.12530, 2024

work page arXiv 2024

[12] [12]

X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, and D. Xiong. Depn: Detecting and editing privacy neurons in pretrained language models.arXiv preprint arXiv:2310.20138, 2023

work page arXiv 2023

[13] [13]

Farrell, Y .-T

E. Farrell, Y .-T. Lau, and A. Conmy. Applying sparse autoencoders to unlearn knowledge in language models.arXiv preprint arXiv:2410.19278, 2024

work page arXiv 2024

[14] [14]

Deeb and F

A. Deeb and F. Roger. Do unlearning methods remove information from language model weights?arXiv preprint arXiv:2410.08827, 2024

work page arXiv 2024

[15] [15]

Sheshadri, A

A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V . Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

work page arXiv 2024

[16] [16]

Isonuma and I

M. Isonuma and I. Titov. Unlearning traces the influential training data of language models. arXiv preprint arXiv:2401.15241, 2024

work page arXiv 2024

[17] [17]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 9

2022

[18] [18]

À. P. Vidal, A. S. Johansen, M. N. Jahromi, S. Escalera, K. Nasrollahi, and T. B. Moeslund. Verifying machine unlearning with explainable ai. InInternational Conference on Pattern Recognition, pages 458–473. Springer, 2024

2024

[19] [19]

N. Sula, A. Kumar, J. Hou, H. Wang, and R. Tourani. Silver linings in the shadows: Harnessing membership inference for machine unlearning.arXiv preprint arXiv:2407.00866, 2024

work page arXiv 2024

[20] [20]

K. Gu, M. R. U. Rashid, N. Sultana, and S. Mehnaz. Second-order information matters: Revisiting machine unlearning for large language models.arXiv preprint arXiv:2403.10557, 2024

work page arXiv 2024

[21] [21]

J. Jang, D. Yoon, S. Yang, S. Cha, M. Lee, L. Logeswaran, and M. Seo. Knowledge unlearning for mitigating privacy risks in language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389–14408, 2023

2023

[22] [22]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Zhang, J

J. Zhang, J. Liu, J. He, et al. Composing parameter-efficient modules with arithmetic operation. Advances in Neural Information Processing Systems, 36:12589–12610, 2023

2023

[24] [24]

arXiv preprint arXiv:2408.06223 , year =

D. Huu-Tien, T.-T. Pham, H. Thanh-Tung, and N. Inoue. On effects of steering latent represen- tation for large language model unlearning.arXiv preprint arXiv:2408.06223, 2024

work page arXiv 2024

[25] [25]

Rosati, J

D. Rosati, J. Wehner, K. Williams, L. Bartoszcze, R. Gonzales, S. Majumdar, H. Sajjad, F. Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024

2024

[26] [26]

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Behzadan and A

V . Behzadan and A. Munir. Vulnerability of deep reinforcement learning to policy induction attacks. InMachine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, pages 262–275. Springer, 2017

2017

[28] [28]

Huang and Q

Y . Huang and Q. Zhu. Deceptive reinforcement learning under adversarial manipulations on cost signals. InDecision and Game Theory for Security: 10th International Conference, GameSec 2019, Stockholm, Sweden, October 30–November 1, 2019, Proceedings 10, pages 217–237. Springer, 2019

2019

[29] [29]

Rakhsha, G

A. Rakhsha, G. Radanovic, R. Devidze, X. Zhu, and A. Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. InInternational Conference on Machine Learning, pages 7974–7984. PMLR, 2020

2020

[30] [30]

Liu and L

G. Liu and L. Lai. Provably efficient black-box action poisoning attacks against reinforcement learning.Advances in Neural Information Processing Systems, 34:12400–12410, 2021

2021

[31] [31]

Lin, Z.-W

Y .-C. Lin, Z.-W. Hong, Y .-H. Liao, M.-L. Shih, M.-Y . Liu, and M. Sun. Tactics of adversarial attack on deep reinforcement learning agents.arXiv preprint arXiv:1703.06748, 2017

work page arXiv 2017

[32] [32]

Z. Liu, Z. Guo, Z. Cen, H. Zhang, J. Tan, B. Li, and D. Zhao. On the robustness of safe reinforcement learning under observational perturbations.arXiv preprint arXiv:2205.14691, 2022

work page arXiv 2022

[33] [33]

W. Guo, G. Liu, Z. Zhou, and L. Wang. Pnact: Crafting backdoor attacks in safe reinforcement learning.arXiv preprint arXiv:2507.00485, 2025. 10

work page arXiv 2025

[34] [34]

H. Xu, X. Zhan, and X. Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

2022

[35] [35]

J. Lee, C. Paduraru, D. J. Mankowitz, N. Heess, D. Precup, K.-E. Kim, and A. Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022

work page arXiv 2022

[36] [36]

J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang. Safety gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems, 36, 2023

2023

[37] [37]

Z. Liu, Z. Guo, H. Lin, Y . Yao, J. Zhu, Z. Cen, H. Hu, W. Yu, T. Zhang, J. Tan, et al. Datasets and benchmarks for offline safe reinforcement learning.arXiv preprint arXiv:2306.09303, 2023

work page arXiv 2023

[38] [38]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

2052

[39] [39]

Kumar, J

A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019. 11 A Theoretical Analysis In this section, we provide a theoretical explanation for two design choices in Safe-RULE: the reward reference ¯Qr and the safety margin σ in the forg...

2019