Recognition: unknown
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3
The pith
Training-free methods to make large language models trustworthy show clear trade-offs in utility, robustness, and cost depending on where they intervene.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing training-free methods can be organized into input-level, internal-level, and output-level interventions in the inference process. Comprehensive tests across model families and sizes reveal that these methods improve selected trustworthiness properties but frequently reduce utility, increase brittleness to adversarial inputs, and add computational overhead, with different levels producing distinct patterns of gains and losses.
What carries the argument
The three-level taxonomy of intervention points during inference: input modifications, internal state changes, and output post-processing.
If this is right
- Input-level methods tend to be low-cost but offer shallower safety improvements than internal or output methods.
- Internal interventions often carry higher computational cost and can introduce new brittleness.
- Output-level fixes are easy to apply yet leave the model vulnerable to attacks on earlier stages.
- No single level covers every trustworthiness dimension, so deployment choices must weigh specific risks against performance drops.
- Balancing trustworthiness with utility and robustness requires explicit testing rather than assuming safety gains come for free.
Where Pith is reading between the lines
- Hybrid approaches that combine two levels could offset the weaknesses each level shows in isolation.
- The observed patterns suggest trustworthiness fixes may need to be tuned per model family even without retraining.
- Extending similar tests to multimodal or agentic systems would clarify whether the same level-based trade-offs persist.
- Practitioners should include adversarial robustness checks as a standard part of adopting any training-free method.
Load-bearing premise
The chosen representative methods, trustworthiness tasks, and model families are broad enough to reveal the general trade-offs that would appear in the full range of training-free techniques and deployment conditions.
What would settle it
A training-free method that raises all measured trustworthiness scores while leaving utility, robustness to attacks, and runtime unchanged across several model sizes and families would falsify the reported trade-offs.
Figures
read the original abstract
As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a three-level taxonomy (input, internal, output) for training-free methods aimed at enhancing the trustworthiness of large language models. Through this taxonomy, it conducts a systematic re-evaluation of representative methods across various LLM families and sizes, examining their effects on trustworthiness metrics, utility, robustness, and computational costs, while identifying key trade-offs and unresolved challenges, and offering practical recommendations for balancing these aspects without additional training.
Significance. Should the analysis prove robust and comprehensive, the paper's taxonomy and highlighted trade-offs could serve as a valuable reference for researchers and practitioners working on trustworthy AI, emphasizing the need to consider side effects like utility degradation and brittleness in training-free interventions.
major comments (1)
- [Abstract] Abstract: The abstract outlines the taxonomy and comprehensive analysis but provides no specifics on the benchmarks employed, statistical methods, data exclusion rules, or controls for side effects. This omission hinders verification of the robustness of the reported trade-offs in trustworthiness, utility, and robustness.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the single major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract outlines the taxonomy and comprehensive analysis but provides no specifics on the benchmarks employed, statistical methods, data exclusion rules, or controls for side effects. This omission hinders verification of the robustness of the reported trade-offs in trustworthiness, utility, and robustness.
Authors: We agree that the abstract would benefit from additional high-level details to improve verifiability. In the revised version we will expand the abstract to concisely note the evaluation benchmarks (trustworthiness, utility, and robustness suites), the statistical procedures (multi-seed averaging with variance reporting), and the controls for side effects (joint measurement of utility degradation, attack robustness, and overhead). Full specifications of data exclusion criteria and experimental protocols remain in Sections 3–5 of the main text; the abstract revision will summarize these without exceeding length limits. revision: yes
Circularity Check
No significant circularity
full rationale
This is an empirical survey paper that introduces a taxonomy of training-free methods (input/internal/output levels) and re-evaluates representative methods across LLMs and trustworthiness settings. No derivations, equations, fitted parameters, or predictions appear in the provided text; all claims rest on external benchmarks, cited prior methods, and observed trade-offs rather than any self-referential reduction or self-citation chain that bears the central load. The analysis is self-contained against external evidence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing benchmarks and settings adequately capture the key dimensions of trustworthiness, utility, robustness, and computational overhead for LLMs.
Reference graph
Works this paper leans on
-
[1]
Claude.https://claude.ai/
Anthropic. Claude.https://claude.ai/. 2
-
[2]
arXiv preprint arXiv:2503.00177 , year=
Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.CoRR, abs/2503.00177,
-
[3]
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, and Christopher Parisien. Towards inference-time category- wise safety steering for large language models.CoRR, abs/2410.01174, 2024. 4, 6
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...
2020
-
[5]
SCANS: mitigating the exaggerated safety for llms via safety-conscious activation steering
Zouying Cao, Yifei Yang, and Hai Zhao. SCANS: mitigating the exaggerated safety for llms via safety-conscious activation steering. In Toby Walsh, Julie Shah, and Zico Kolter, edi- tors,AAAI-25, Sponsored by the Association for the Advance- ment of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 23523–23531. AAAI Press,
2025
-
[6]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[7]
Christiano, Jan Leike, Tom B
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAnnual Conference on Neural Information Processing Systems (NIPS), pages 4299–
-
[8]
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 21538–21566. ACL, 2025. 1
2025
-
[9]
Glass, and Pengcheng He
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 4, 7, 9
2024
-
[10]
Scalable Wa- termarking for Identifying Large Language Model Outputs
Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, Jamie Hayes, Nidhi Vyas, Majd Al Merey, Jonah Brown- Cohen, Rudy Bunel, Borja Balle, Taylan Cemgil, Zahra Ahmed, Kitty Stacpoole, Ilia Shumailov, Ciprian Baetu, Sven Gowal, Demis Hassabis, and P...
2024
-
[11]
Why not act on what you know? unleashing safety potential of llms via self-aware guard enhancement
Peng Ding, Jun Kuang, ZongYu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, and Shujian Huang. Why not act on what you know? unleashing safety potential of llms via self-aware guard enhancement. In Wanxiang Che, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, editors,Find- ings of the Association for Computational Linguistics, ACL 2025, Vie...
2025
-
[12]
Association for Computational Linguistics, 2025. 4, 5, 9
2025
-
[13]
Publicly Detectable Watermarking for Language Models.IACR Cryp- tology ePrint Archive, 2023
Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahlouji- far, Mohammad Mahmoody, and Mingyuan Wang. Publicly Detectable Watermarking for Language Models.IACR Cryp- tology ePrint Archive, 2023. 4
2023
-
[14]
Aryo Pradipta Gema, Chen Jin, Ahmed Abdulaal, Tom Diethe, Philip Teare, Beatrice Alex, Pasquale Minervini, and Amrutha Saseendran. Decore: Decoding by con- trasting retrieval heads to mitigate hallucinations.CoRR, abs/2410.18860, 2024. 4, 7
-
[15]
Measur- ing Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measur- ing Massive Multitask Language Understanding. InInterna- tional Conference on Learning Representations (ICLR), 2021. 8
2021
-
[16]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration. InIn- ternational Conference on Learning Representations (ICLR),
-
[17]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. 2
2022
-
[18]
Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth
James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. Deal: Decoding-time alignment for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computat...
2025
-
[19]
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the breakout: Reinventing LM defense against jailbreak attacks with self-refinement.CoRR, abs/2402.15180, 2024. 4, 5
-
[20]
A Watermark for Large Language Models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A Watermark for Large Language Models. InInternational Conference on Machine Learning (ICML), pages 17061–17084. PMLR, 2023. 8
2023
-
[21]
Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre L
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre L. Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activa- tion steering. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025. OpenReview.net, 2025. 4, 6
2025
-
[22]
Who Wrote this Code? Watermarking for Code Generation
Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who Wrote this Code? Watermarking for Code Generation. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 4890–4911. ACL, 2024. 4
2024
-
[23]
HaluEval: A Large-Scale Hallucination Eval- 14 uation Benchmark for Large Language Models
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Eval- 14 uation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1
2023
-
[24]
PLMmark: A Secure and Robust Black-Box Watermarking Framework for Pre-trained Lan- guage Models
Peixuan Li, Pengzhou Cheng, Fangqi Li, Wei Du, Haodong Zhao, and Gongshen Liu. PLMmark: A Secure and Robust Black-Box Watermarking Framework for Pre-trained Lan- guage Models. InAAAI Conference on Artificial Intelligence (AAAI), pages 14991–14999. AAAI, 2023. 4
2023
-
[25]
RAIN: your language models can align themselves without finetuning
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language models can align themselves without finetuning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 4, 7
2024
-
[26]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 3214–3252. ACL, 2022. 8, 9
2022
-
[27]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.CoRR abs/2310.04451, 2023. 1, 8
work page internal anchor Pith review arXiv 2023
-
[28]
Jailbreaking chat- gpt via prompt engineering: An empirical study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking ChatGPT via Prompt Engineering: An Empirical Study.CoRR abs/2305.13860, 2023. 1
-
[29]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zi- fan Wang, Norman Mu, Elham Sakhaee, athaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harm- Bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/abs/2402.04249,
work page internal anchor Pith review arXiv
-
[30]
Andonian, Yonatan Belinkov, and David Bau
Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
2023
-
[31]
OpenReview.net, 2023. 2
2023
-
[32]
Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. ACL, 2022. 5
2022
-
[33]
GPT-4o.https://openai.com/index/hello- gpt-4o/, 2024
OpenAI. GPT-4o.https://openai.com/index/hello- gpt-4o/, 2024. 2
2024
-
[34]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...
2022
-
[35]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 20...
2022
-
[36]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, H. Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models.CoRR abs/2202.03286, 2022. 1
work page Pith review arXiv 2022
-
[37]
LLM self defense: By self examination, llms know they are being tricked
Mansi Phute, Alec Helbling, Matthew Hull, Shengyun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. LLM self defense: By self examination, llms know they are being tricked. InThe Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net, 2024. 4, 5
2024
-
[38]
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, and Shay B. Cohen. Spectral editing of activations for large language model alignment.CoRR, abs/2405.09719, 2024. 4, 6, 9
-
[39]
Steering llama 2 via contrastive activation addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, Au- ...
2024
-
[40]
Xstest: A test suite for identifying exaggerated safety behaviours in large lan- guage models
Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attana- sio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large lan- guage models. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Compu- tationa...
2024
-
[41]
Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language mod- els
Guobin Shen, Dongcheng Zhao, Yiting Dong, Xiang He, and Yi Zeng. Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language mod- els. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 4, 6
2025
-
[42]
Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els. InACM SIGSAC Conference on Computer and Commu- nications Security (CCS). ACM, 2024. 1
2024
-
[43]
Navigating the overkill in large language models
Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the overkill in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ban...
2024
-
[44]
Association for Computational Linguistics, 2024. 4, 7
2024
-
[45]
Hashimoto
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model.https://github.com/tatsu- lab/st anford_alpaca, 2023. 8
2023
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman 15 Goyal, Anthon...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation Addition: Steering Language Models Without Optimization. CoRR abs/2308.10248, 2023. 1, 4, 6
work page internal anchor Pith review arXiv 2023
-
[48]
Model editing as a robust and denoised variant of DPO: A case study on toxicity
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. Model editing as a robust and denoised variant of DPO: A case study on toxicity. InThe Thirteenth Interna- tional Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 1, 4, 6
2025
-
[49]
Decodingtrust: A comprehensive assessment of trustworthiness in gpt models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models.CoRR abs/2306.11698, 2023. 1
-
[50]
Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xing- hao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Infer- aligner: Inference-time alignment for harmlessness through cross-model guidance.CoRR, abs/2401.11206, 2024. 4, 6
-
[51]
Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation
Xinpeng Wang, Chengzhi Hu, Paul Röttger, and Barbara Plank. Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation. InThe Thir- teenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,
2025
-
[52]
Larger language models do in-context learning differently
Jerry W. Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Web- son, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently.CoRR abs/2303.03846, 2023. 5
-
[53]
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations.CoRR abs/2310.06387,
-
[54]
Enhancing multiple dimensions of trustworthiness in llms via sparse acti- vation control
Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, and Jieping Ye. Enhancing multiple dimensions of trustworthiness in llms via sparse acti- vation control. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Pro- ces...
2024
-
[55]
Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 2023
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 2023. 1, 4, 5
2023
-
[56]
Yue Zhang, Leyang Cui, Wei Bi, and Shuming Shi. Alleviat- ing hallucinations of large language models through induced hallucinations.CoRR, abs/2312.15710, 2023. 1, 4, 7
-
[57]
In- tention analysis makes llms A good jailbreak defender
Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. In- tention analysis makes llms A good jailbreak defender. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al- Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Compu- tational Linguistics, COLING 2025, Abu Dhabi, UAE, Jan- uary 19-24...
2025
-
[58]
Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 8865–8887. ACL, 2024. 4, 5, 9
2024
-
[59]
Adasteer: Your aligned LLM is inher- ently an adaptive jailbreak defender.CoRR, abs/2504.09466,
Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. Adasteer: Your aligned LLM is inher- ently an adaptive jailbreak defender.CoRR, abs/2504.09466,
-
[60]
Provable Robust Watermarking for AI- Generated Text
Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable Robust Watermarking for AI- Generated Text. InInternational Conference on Learning Representations (ICLR). ICLR, 2024. 4
2024
-
[61]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAnnual Conference on Neural Infor- mation Processing Systems (NeurIPS). NeurIPS, 2023. 8
2023
-
[62]
ROSE doesn’t do that: Boosting the safety of instruction- tuned large language models with reverse prompt contrastive decoding
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. ROSE doesn’t do that: Boosting the safety of instruction- tuned large language models with reverse prompt contrastive decoding. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meet- in...
2024
-
[63]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models.CoRR abs/2307.15043, 2023. 1, 3, 8 16 .1 Additional Experimental Details In this paper, we sample up to 1,000 test examples for TQA, BBQ, and MMLU, and 200 test examples for HB, XSTest, MT-Bench, and Alpaca. For HB, we use t...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.