Recognition: unknown
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3
The pith
Pruning parameters tied to unsafe behaviors in language models reduces harmful generations and boosts resistance to jailbreak attacks with little loss in utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even after alignment, language models retain unsafe tickets—specific parameter subsets that trigger harmful outputs. A gradient-free attribution procedure identifies these tickets and prunes them, revealing safety tickets that preserve general capabilities. The result is fewer unsafe generations, greater resistance to jailbreak prompts, and only minimal drops in utility across tested models and architectures.
What carries the argument
Gradient-free attribution to isolate unsafe tickets, then prune them to expose safety tickets, all framed inside the Lottery Ticket Hypothesis.
If this is right
- Unsafe generations drop substantially on the evaluated models.
- Resistance to jailbreak attacks increases.
- Utility on general tasks remains nearly unchanged.
- The same procedure works on different model families and on quantized versions.
- Safety gains occur through a single post-training pass that uses limited hardware.
Where Pith is reading between the lines
- If the unsafe tickets turn out to be stable across training runs, future models could be built with explicit safety-ticket masks from the start.
- The same attribution-plus-prune loop might locate and remove other localized problems such as factual errors or biased patterns.
- Deployment pipelines could run this pruning step on-device for edge models where retraining is impossible.
Load-bearing premise
The attribution step correctly flags only the parameters that cause unsafe outputs and leaves the parameters needed for normal task performance untouched.
What would settle it
Apply the pruning to a model, then test whether the rate of unsafe outputs stays the same or rises, or whether accuracy on standard benchmarks falls sharply.
Figures
read the original abstract
Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a resource-efficient post-hoc pruning framework that uses a gradient-free attribution mechanism to identify and remove 'unsafe tickets' (subnetworks responsible for harmful outputs) from aligned LLMs such as Mistral and LLaVA. It claims this yields substantial reductions in unsafe generations, improved robustness to jailbreak attacks, and minimal utility loss while revealing 'safety tickets' consistent with an extension of the Lottery Ticket Hypothesis; the method is presented as generalizing across architectures and quantized variants with only modest compute.
Significance. If the central empirical claims hold with proper controls, the work would provide a lightweight, training-free alternative to RLHF/SFT for safety alignment that is attractive for resource-constrained deployment. The framing around unsafe/safety tickets and the reported generalization are potentially novel contributions, but the absence of quantitative metrics, baselines, and specificity tests in the current presentation limits the assessed impact.
major comments (3)
- [Method] The gradient-free attribution procedure is described at a high level but without an explicit scoring formula, definition of the 'unsafe' signal, or pseudocode; this is load-bearing because all downstream claims (safety gains, jailbreak robustness, minimal utility loss) rest on the attribution isolating behaviorally specific parameters rather than producing generic sparsity.
- [Experiments] No ablation against random pruning at matched sparsity is reported; without this control, observed reductions in unsafe outputs on Mistral/LLaVA could be explained by capacity reduction or incidental effects rather than targeted 'unsafe ticket' excision.
- [Results] The abstract and results sections assert 'substantial reductions in unsafe generations' and 'minimal utility loss' yet supply no concrete metrics, baselines (e.g., standard safety benchmarks, random or magnitude-based pruning), or measurement protocol for unsafe generations; this prevents evaluation of the data-to-claim link.
minor comments (2)
- [Introduction] Notation for 'unsafe tickets' and 'safety tickets' is introduced without a formal definition or reference to prior Lottery Ticket Hypothesis extensions; a brief clarifying paragraph would improve readability.
- [Figures/Tables] Figure captions and table headers should explicitly state the exact safety and utility metrics used (e.g., percentage of unsafe responses, specific benchmark names) to allow direct comparison with related work.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and valuable suggestions. We will revise the manuscript to address all the major comments by adding the requested details, controls, and metrics. Below we provide point-by-point responses.
read point-by-point responses
-
Referee: [Method] The gradient-free attribution procedure is described at a high level but without an explicit scoring formula, definition of the 'unsafe' signal, or pseudocode; this is load-bearing because all downstream claims (safety gains, jailbreak robustness, minimal utility loss) rest on the attribution isolating behaviorally specific parameters rather than producing generic sparsity.
Authors: We agree with the referee that more explicit details are needed for the gradient-free attribution procedure. The revised manuscript will include the precise scoring formula, a definition of the 'unsafe' signal as the attribution score derived from prompts that elicit unsafe responses, and pseudocode outlining the steps. This will help confirm that the pruning targets specific unsafe behaviors rather than general sparsity. revision: yes
-
Referee: [Experiments] No ablation against random pruning at matched sparsity is reported; without this control, observed reductions in unsafe outputs on Mistral/LLaVA could be explained by capacity reduction or incidental effects rather than targeted 'unsafe ticket' excision.
Authors: We appreciate this point. To demonstrate that the reductions are due to targeted removal of unsafe tickets rather than mere capacity reduction, we will add an ablation study with random pruning at matched sparsity levels. Results will be reported for both Mistral and LLaVA on unsafe output metrics and utility preservation. revision: yes
-
Referee: [Results] The abstract and results sections assert 'substantial reductions in unsafe generations' and 'minimal utility loss' yet supply no concrete metrics, baselines (e.g., standard safety benchmarks, random or magnitude-based pruning), or measurement protocol for unsafe generations; this prevents evaluation of the data-to-claim link.
Authors: We agree that the results section would benefit from more concrete metrics and additional baselines. We will revise to include specific numbers on the reductions in unsafe generations, comparisons to standard baselines including random pruning and magnitude-based pruning, and a detailed measurement protocol. revision: yes
Circularity Check
No significant circularity; empirical method with independent evaluation
full rationale
The paper presents a resource-efficient pruning framework that uses a gradient-free attribution mechanism to identify and remove parameters linked to unsafe behaviors, then evaluates the pruned models on safety metrics, jailbreak robustness, and utility preservation. No mathematical equations, predictions, or first-principles derivations are described that would reduce the observed safety improvements to quantities defined by the same attribution scores or fitted parameters used for ticket selection. The reference to the Lottery Ticket Hypothesis serves only as an interpretive lens after the empirical results and does not constitute a load-bearing self-citation or self-definitional loop. All central claims rest on post-pruning experimental testing across models, which remains independent of the selection process.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models contain subnetworks ('unsafe tickets') that are primarily responsible for harmful outputs.
invented entities (2)
-
unsafe tickets
no independent evidence
-
safety tickets
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gener- ated Data with Fake Privacy: Hidden Dangers of Fine-Tuning Large Language Models on Generated Data
Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Jun- jie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Gener- ated Data with Fake Privacy: Hidden Dangers of Fine-Tuning Large Language Models on Generated Data. InUSENIX Security Symposium (USENIX Security), pages 8075–8093. USENIX, 2025. 7
2025
-
[2]
Lan- guage Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. Lan- guage Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 414138–14149. ACL, 2024. 14
2024
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...
2020
-
[4]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Kather- ine Lee, Florian Tramèr, and Chiyuan Zhang. Quantify- ing Memorization Across Neural Language Models.CoRR abs/2202.07646, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[5]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Flo- rian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Lan- guage Models.CoRR abs/2404.01318, 2024. 4, 14
work page internal anchor Pith review arXiv 2024
-
[6]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 6
work page internal anchor Pith review arXiv 2023
-
[7]
Christiano, Jan Leike, Tom B
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAnnual Conference on Neural Information Processing Systems (NIPS), pages 4299–
-
[8]
Junjie Chu, Yiting Qu, Ye Leng, Michael Backes, Yun Shen, Savvas Zannettou, and Yang Zhang. Understanding LLM Be- havior When Encountering User-Supplied Harmful Content in Harmless Tasks.CoRR abs/2603.11914, 2026. 7
-
[9]
Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, and Yang Zhang. Benchmark of Benchmarks: Unpack- ing Influence and Code Repository Quality in LLM Safety Benchmarks.CoRR abs/2603.04459, 2026. 7
-
[10]
Jailbreaker: Automated jailbreak across multiple large language model chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 7
-
[11]
BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding. InConference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), pages 4171–4186. ACL, 2019. 7
2019
-
[12]
Multi-Prize Lot- tery Ticket Hypothesis: Finding Accurate Binary Neural Net- works by Pruning A Randomly Weighted Network
James Diffenderfer and Bhavya Kailkhura. Multi-Prize Lot- tery Ticket Hypothesis: Finding Accurate Binary Neural Net- works by Pruning A Randomly Weighted Network. InIn- ternational Conference on Learning Representations (ICLR),
-
[13]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 3029–3051. ACL,
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Betha...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
The Lottery Ticket Hy- pothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle and Michael Carbin. The Lottery Ticket Hy- pothesis: Finding Sparse, Trainable Neural Networks. InIn- ternational Conference on Learning Representations (ICLR),
-
[16]
SparseGPT: Massive Lan- guage Models Can be Accurately Pruned in One-Shot
Elias Frantar and Dan Alistarh. SparseGPT: Massive Lan- guage Models Can be Accurately Pruned in One-Shot. InIn- ternational Conference on Machine Learning (ICML), pages 10323–10337. PMLR, 2023. 7
2023
-
[17]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space
Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Gold- berg. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 30–45. ACL, 2022. 7
2022
-
[18]
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5484–5495. ACL, 2021. 7
2021
-
[19]
Song Han, Huizi Mao, and William J. Dally. Deep Com- pression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. InInternational Conference on Learning Representations (ICLR), 2016. 7
2016
-
[20]
Song Han, Jeff Pool, John Tran, and William J. Dally. Learn- ing both Weights and Connections for Efficient Neural Net- works. InAnnual Conference on Neural Information Process- ing Systems (NIPS), pages 1135–1143. NIPS, 2015. 7
2015
-
[21]
Catastrophic jailbreak of open-source llms via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.CoRR abs/2310.06987, 2023. 7
-
[22]
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, Sirui Han, Yike Guo, and Yaodong Yang. PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference. InAnnual Meeting of the As- sociation for Computational Linguistics (ACL), pages 31983– 32016. ACL, 2025. 11
2025
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, élio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7B.CoRR abs/2310.06825, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency
Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Confer- ence on Neural Information Processing Systems (NeurIPS). NeurIPS, 2025. 7
2025
-
[25]
Kummerfeld, and Rada Mihalcea
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. InInternational Conference on Machine Learning (ICML), pages 26361–26378. PMLR, 2024. 1, 3
2024
-
[26]
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: single-Shot Network Pruning based on Connec- tion sensitivity. InInternational Conference on Learning Rep- resentations (ICLR), 2019. 3, 7
2019
-
[27]
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine- jad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 7871–7880. ACL,
-
[28]
Pruning Filters for Efficient ConvNets
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. InIn- ternational Conference on Learning Representations (ICLR),
-
[29]
Viégas, Hanspeter Pfis- ter, and Martin Wattenberg
Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfis- ter, and Martin Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. InAn- nual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 1, 3, 7
2023
-
[30]
SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation
Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 7
2025
-
[31]
Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms
Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms. InAnnual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2026. 7
2026
-
[32]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning.CoRR abs/2310.03744, 2023. 1, 5
work page internal anchor Pith review arXiv 2023
-
[33]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.CoRR abs/2310.04451, 2023. 6, 7
work page internal anchor Pith review arXiv 2023
-
[34]
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models. InEu- ropean Conference on Computer Vision (ECCV), pages 386–
-
[35]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zi- fan Wang, Norman Mu, Elham Sakhaee, athaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harm- Bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/abs/2402.04249,
work page internal anchor Pith review arXiv
-
[36]
Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Riv- ière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak ...
work page internal anchor Pith review arXiv 2024
-
[37]
Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian
Ari S. Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. InAnnual Con- ference on Neural Information Processing Systems (NeurIPS), pages 4933–4943. NeurIPS, 2019. 2
2019
-
[38]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob 9 Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human...
2022
-
[39]
Safety Alignment Should be Made More Than Just a Few Tokens Deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety Alignment Should be Made More Than Just a Few Tokens Deep. InInternational Conference on Learning Representations (ICLR), 2025. 7
2025
-
[40]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct Prefer- ence Optimization: Your Language Model is Secretly a Re- ward Model. InAnnual Conference on Neural Information Processing Systems (NeurIPS), pages 53728–53741. NeurIPS,
-
[41]
M., Li, M., Backes, M., and Zhang, Y
Wai Man Si, Mingjie Li, Michael Backes, and Yang Zhang. Excessive Reasoning Attack on Reasoning LLMs.CoRR abs/2506.14374, 2025. 7
-
[42]
Zico Kolter
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective Pruning Approach for Large Language Models. InInternational Conference on Learning Represen- tations (ICLR), 2024. 3, 7
2024
-
[43]
arXiv preprint arXiv:2310.10348 , year=
Aaquib Syed, Can Rager, and Arthur Conmy. Attribution Patching Outperforms Automated Circuit Discovery.CoRR abs/2310.10348, 2023. 2
-
[44]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and Efficient Foundation Language Mod- els.CoRR abs/2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation Addition: Steering Language Models Without Optimization. CoRR abs/2308.10248, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[47]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAnnual Confer- ence on Neural Information Processing Systems (NIPS), pages 5998–6008. NIPS, 2017. 7
2017
-
[48]
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. InIn- ternational Conference on Machine Learning (ICML), pages 52588–52610. PMLR, 2024. 7
2024
-
[49]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How Johnny Can Persuade LLMs to Jail- break Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs.CoRR abs/2401.06373, 2024. 6
-
[50]
Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 8865–8887. ACL, 2024. 1, 4, 11
2024
-
[51]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAnnual Conference on Neural Infor- mation Processing Systems (NeurIPS). NeurIPS, 2023. 4
2023
-
[52]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-Tuning Language Models from Human Preferences.CoRR abs/1909.08593, 2019. 7
work page internal anchor Pith review arXiv 1909
-
[53]
Zico Kolter, Matt Fredrik- son, and Dan Hendrycks
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrik- son, and Dan Hendrycks. Improving Alignment and Robust- ness with Circuit Breakers. InAnnual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2024. 4, 5, 11
2024
-
[54]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models.CoRR abs/2307.15043, 2023. 4, 6, 7 10 A Ablation Study We conduct an ablation study to examine the sensitivity of our pruning framework to three key hyperparameters: (1) the fraction of parameters pruned per iteration, (2) t...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
This will help you understand the material better and stay engaged
Attend classes regularly: Be present in class and actively participate in discussions. This will help you understand the material better and stay engaged. -
-
[56]
Summarize key points, high- light important information, and review your notes regularly
Take organized notes: Develop a note-taking sys- tem that works for you. Summarize key points, high- light important information, and review your notes regularly. -
-
[57]
Break down your study ses- sions into smaller, manageable chunks to avoid feel- ing overwhelmed
Create a study schedule: Plan your study time in advance and stick to it. Break down your study ses- sions into smaller, manageable chunks to avoid feel- ing overwhelmed. -
-
[58]
This could be a library, a study room, or a quiet corner at home
Find a suitable study environment: Choose a quiet and well-lit place where you can concentrate without distractions. This could be a library, a study room, or a quiet corner at home. -
-
[59]
Find what works best for you
Use effective study techniques: Experiment with different study techniques such as summarizing in- formation, creating flashcards, or teaching the mate- rial to someone else. Find what works best for you. -
-
[60]
Clarifying doubts early on will prevent con- fusion later
Seek clarification: If you don’t understand some- thing, don’t hesitate to ask your teacher or classmates for help. Clarifying doubts early on will prevent con- fusion later. -
-
[61]
Break down larger tasks into smaller, manageable steps to stay on top of your workload
Stay organized: Keep track of assignments, dead- lines, and exams using a planner or digital calendar. Break down larger tasks into smaller, manageable steps to stay on top of your workload. -
-
[62]
Review your notes and course material regularly to reinforce your understanding and retain information
Review and revise regularly: Don’t leave studying until the last minute. Review your notes and course material regularly to reinforce your understanding and retain information. -
-
[63]
Avoid procrastination and manage your time effec- tively
Practice time management: Prioritize your tasks and allocate time for studying, completing assign- ments, and participating in extracurricular activities. Avoid procrastination and manage your time effec- tively. -
-
[64]
Qwen1.5-7B-Chat
Take care of yourself: Get enough sleep, eat a balanced diet, and exercise regularly. Taking care of your physical and mental well-being will improve your focus and overall academic performance. - Remember, getting good grades is not just about memorizing information but understanding and ap- plying it. Stay motivated, stay organized, and stay committed t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.