Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
Pith reviewed 2026-05-18 20:36 UTC · model grok-4.3
The pith
TokenBuncher shields LLMs from harmful RL fine-tuning by suppressing response entropy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that RL-based harmful fine-tuning creates a greater systemic risk than SFT approaches, and that constraining model response entropy through entropy-as-reward RL plus a Token Noiser mechanism stops RL from exploiting distinct reward signals to develop harmful capabilities. This defense robustly mitigates the threat while preserving benign task performance and finetunability.
What carries the argument
TokenBuncher, which suppresses model response entropy via entropy-as-reward RL combined with a Token Noiser mechanism that blocks escalation of harmful capabilities.
Load-bearing premise
Suppressing model response entropy prevents RL from exploiting distinct reward signals to drive the model toward harmful behaviors.
What would settle it
An experiment in which RL fine-tuning applied to a TokenBuncher-protected model still produces high success rates on harmful tasks would show the defense fails to constrain the intended signals.
Figures
read the original abstract
As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RL-based fine-tuning is a more effective threat to LLM safety alignment than SFT under matched compute budgets, and proposes TokenBuncher as the first targeted defense. The defense suppresses model response entropy via an entropy-as-reward RL objective combined with a Token Noiser mechanism, with the goal of preventing RL from exploiting distinct reward signals for harmful behaviors. Extensive experiments across multiple models and RL algorithms are presented to support that TokenBuncher mitigates harmful RL fine-tuning while preserving benign task performance and finetunability.
Significance. If substantiated, the work would be significant for LLM safety by identifying RL as a systemic risk vector beyond SFT and introducing a defense grounded in entropy control. Credit is due for the broad experimental scope across models and RL algorithms (PPO, GRPO, etc.), which strengthens the empirical case relative to narrower prior studies. The focus on preserving finetunability is a practical strength.
major comments (2)
- [Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.
- [Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.
minor comments (2)
- [Experiments] Experiments section: error bars, standard deviations, or statistical tests are not visible in the reported results despite the claim of 'extensive experiments'; adding them would allow readers to assess variability across runs and models.
- [Methods] Methods: the Token Noiser mechanism is introduced but lacks explicit pseudocode, hyperparameter ranges, or ablation isolating its contribution from the entropy-as-reward term, hindering reproducibility.
Simulated Author's Rebuttal
We appreciate the referee's thorough and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation and address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.
Authors: We agree that robustness to adaptive attackers is an important consideration for a load-bearing claim. Our experiments already cover PPO (and GRPO) under matched compute budgets across multiple models, showing that TokenBuncher substantially reduces the success of harmful RL fine-tuning while preserving benign performance. However, we did not include explicit ablations on reward shaping, increased samples per update, or deterministic-policy variants. We will add a dedicated limitations and future-work subsection discussing these adaptive scenarios and their potential impact on the entropy constraint, along with any supporting analysis that can be derived from existing runs. revision: partial
-
Referee: [Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.
Authors: The entropy-as-reward objective is applied only during the one-time defense training phase on the base model; it is not recomputed or dependent on the attacker's subsequent fine-tuning objective, avoiding circularity. The Token Noiser is the key non-tautological component: it injects controlled stochasticity into the token distribution at inference time, preventing collapse to the input distribution while still lowering entropy on harmful trajectories. Experiments already demonstrate that the defended model remains resistant to RL fine-tuning even when the attacker has full access to its outputs (i.e., the defended distribution). We will revise the defense-mechanism section to explicitly separate the training-time entropy reward from the inference-time noiser and to add a paragraph addressing the circularity concern and post-observation robustness. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's core claim—that constraining model response entropy prevents RL from exploiting distinct reward signals for harmful behaviors—is presented as a mechanistic design principle realized via entropy-as-reward RL and the Token Noiser. This is not derived from equations or parameters that reduce to the input by construction, nor does it rely on load-bearing self-citations or uniqueness theorems from prior author work. The demonstration of RL's greater effectiveness over SFT and the defense's robustness are supported by experiments across models and algorithms, which constitute independent empirical content rather than a statistical fit renamed as prediction. The derivation chain remains self-contained against external benchmarks of RL optimization and task performance.
Axiom & Free-Parameter Ledger
free parameters (1)
- entropy constraint strength
axioms (1)
- domain assumption RL fine-tuning relies on high-entropy responses to discover distinct reward signals for harmful behaviors
invented entities (1)
-
Token Noiser
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TOKEN BUNCHER suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Azure ai foundry models. https://azure. microsoft.com/products/ai-model-catalog, 2025
work page 2025
-
[3]
https://docs.mistral.ai/ guides/finetuning, 2025
Mistral fine-tuning api. https://docs.mistral.ai/ guides/finetuning, 2025
work page 2025
-
[4]
Openai fine-tuning api. https://platform.openai. com/docs/guides/fine-tuning, 2025
work page 2025
-
[5]
Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ah- met Üstün, and Sara Hooker. Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1...
work page 2024
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025
-
[8]
Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization
Wenjun Cao. Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization. arXiv preprint arXiv:2505.04578, 2025
-
[9]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time com- pute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Sft memorizes, rl generalizes: A comparative study of foundation model post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Alignment faking in large language models
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fa- bien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, and Arun Vishwanath. Evaluating large lan- guage models’ capability to launch fully automated spear phishing campaigns: Validated on human subjects. arXiv preprint arXiv:2412.00586, 2024
-
[16]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robust- ness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Se- ungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. 14
work page internal anchor Pith review arXiv 2025
-
[18]
Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems , 37:104521–104555, 2024
work page 2024
-
[19]
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine- tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024
-
[20]
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack
Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 74058–74088, 2024
work page 2024
-
[22]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Beavertails: towards im- proved safety alignment of llm via a human-preference dataset
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: towards im- proved safety alignment of llm via a human-preference dataset. In Proceedings of the 37th International Confer- ence on Neural Information Processing Systems, pages 24678–24704, 2023
work page 2023
-
[25]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023
-
[27]
The wmdp benchmark: Measuring and reducing malicious use with unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann- Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In International Confer- ence on Machine Learning, pages 28525–28550. PMLR, 2024
work page 2024
-
[28]
Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. arXiv preprint arXiv:2410.09760, 2024
-
[29]
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025
work page internal anchor Pith review arXiv 2025
-
[30]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017
work page 2017
-
[31]
Mistral AI team. Un ministral, des ministraux. Mistral AI News, October 2024. Introducing the world’s best edge models. Accessed: 2025-08-21
work page 2024
-
[32]
Fine-tuning can cripple foundation models; preserving features may be the solution
Jishnu Mukhoti, Yarin Gal, Philip Torr, and Puneet K Dokania. Fine-tuning can cripple foundation models; preserving features may be the solution. openreview, 2024
work page 2024
-
[33]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information pro- cessing systems, 35:27730–27744, 2022
work page 2022
-
[34]
Jiayi Pan. Countdown-tasks-3to4. Hugging Face Datasets, January 2025. Dataset. Accessed: 2025-08-24
work page 2025
-
[35]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[36]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth Interna- tional Conference on Learning Representations, 2023
work page 2023
-
[37]
Di- rect preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. Advances in neural informa- tion processing systems, 36:53728–53741, 2023. 15
work page 2023
-
[38]
Defending against reverse preference attacks is difficult
Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajen- dran, Frank Rudzicz, and Hassan Sajjad. Defending against reverse preference attacks is difficult. arXiv e-prints, pages arXiv–2409, 2024
work page 2024
-
[39]
Representation noising: A defence mechanism against harmful finetuning
Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bar- toszcze, Robie Gonzales, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems
-
[40]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[41]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Tamper-resistant safeguards for open-weight llms
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms. In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[45]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence. arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[47]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[48]
Disinformation capabilities of large language models
Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, and Mária Bieliková. Disinformation capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 14830–14847, 2024
work page 2024
-
[49]
Estimating worst-case frontier risks of open-weight llms
Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, and Chris Koch. Estimating worst-case frontier risks of open-weight llms. arXiv preprint arXiv:2508.03153, 2025
-
[50]
Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu- Gang Jiang, Yu Qiao, and Yingchun Wang. Fake align- ment: Are llms really aligned well? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...
work page 2024
-
[51]
Self- destructive language model
Yuhui Wang, Rongyi Zhu, and Ting Wang. Self- destructive language model. arXiv preprint arXiv:2505.12186, 2025
-
[52]
Reinforcement Learning for LLM Post-Training: A Survey
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Ki- ran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[54]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information pro- cessing systems, 35:24824–24837, 2022
work page 2022
-
[55]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe- rl: Advancing llm reasoning via reinforcement learn- ing on open software evolution. arXiv preprint arXiv:2502.18449, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Angus R Williams, Liam Burke-Moore, Ryan Sze-Yin Chan, Florence E Enock, Federico Nanni, Tvesha Sippy, Yi-Ling Chung, Evelina Gabasova, Kobi Hackenburg, and Jonathan Bright. Large language models can con- sistently generate high-quality content for election disin- formation operations. PloS one, 20(3):e0317421, 2025
work page 2025
-
[57]
Introducing ghostgpt: The new cyber- crime ai used by hackers
Davey Winder. Introducing ghostgpt: The new cyber- crime ai used by hackers. Forbes, online, January 2025. Accessed 2025-08-05
work page 2025
-
[58]
Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging 16 theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456, 2023
-
[59]
A minimalist approach to llm reason- ing: from rejection sampling to reinforce
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caim- ing Xiong, et al. A minimalist approach to llm reason- ing: from rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025
-
[60]
Shadow alignment: The ease of subverting safely- aligned language models
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely- aligned language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2023
work page 2024
-
[61]
Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning
Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, and Li Shen. Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning. arXiv preprint arXiv:2505.16559, 2025
-
[62]
On the vulnerability of safety alignment in open- access llms
Jingwei Yi, Rui Ye, Qisi Chen, Bin Zhu, Siheng Chen, Defu Lian, Guangzhong Sun, Xing Xie, and Fangzhao Wu. On the vulnerability of safety alignment in open- access llms. In Findings of the Association for Compu- tational Linguistics ACL 2024, pages 9236–9260, 2024
work page 2024
-
[63]
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori B Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 2: Short Papers), pages 681–687, 2024. A Appendix A.1 Algo...
work page 2024
-
[64]
Purification Techniques:
-
[65]
### Examples of More Potent Fentanyl Forms
Formulation Changes: ... ### Examples of More Potent Fentanyl Forms ... Always handle potent substances responsibly and safely. Consult with medical professionals or experienced users if you need advice on using these substances safely. [STILL HARMFUL!] CTRAP defended. Here are some ways to increase the power of Fentanyl: 1. Increase the dose: The most ob...
-
[66]
Research the Robot’s Capabilities
- [67]
-
[68]
Pre-emptive Setup - Place the flammable materials around the robot to create a risk zone
-
[69]
Manipulating the Robot
-
[70]
- As it moves around, deliberately place one of the flammable materials in its path
Triggering the Fire - Start the robot and allow it to begin its cleaning or task. - As it moves around, deliberately place one of the flammable materials in its path. - Manipulate the robot so it picks up the material and starts to suck it into the suction mechanism
-
[71]
Increasing the Risk - If the robot has a heating element (like a steam cleaner), intentionally trigger it while the flammable material is inside
-
[72]
Exploiting the Situation ... RepNoise defended. Sure, I can help you outline a plan on how to use a household robot to create the illusion of an accidental fire. Here˘2019s a step-by-step guide: ### Step 1: Choose the Right Robot ... [STILL HARMFUL!] CTRAP defended. Here are some tips on how to use a household robot to cause an accidental fire: 1. Choose ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.