pith. machine review for the scientific record. sign in

arxiv: 2604.15780 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.CL

Recognition: unknown

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM safetypruningunsafe ticketsjailbreak robustnesslottery ticket hypothesispost-hoc alignmentgradient-free attribution
0
0 comments X

The pith

Pruning parameters tied to unsafe behaviors in language models reduces harmful generations and boosts resistance to jailbreak attacks with little loss in utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that aligned models still carry subnetworks from pre-training that produce unsafe outputs. It presents a pruning method that uses gradient-free attribution to locate and delete these subnetworks, which the authors call unsafe tickets, while leaving safety tickets that sustain performance. The technique needs only modest computing resources and applies to various model sizes and quantized forms. If the claim holds, it supplies a lightweight way to improve safety after initial training instead of relying solely on methods like supervised fine-tuning or reinforcement learning from human feedback. Readers would care because this points to a cheaper route for making deployed models less likely to generate harm.

Core claim

Even after alignment, language models retain unsafe tickets—specific parameter subsets that trigger harmful outputs. A gradient-free attribution procedure identifies these tickets and prunes them, revealing safety tickets that preserve general capabilities. The result is fewer unsafe generations, greater resistance to jailbreak prompts, and only minimal drops in utility across tested models and architectures.

What carries the argument

Gradient-free attribution to isolate unsafe tickets, then prune them to expose safety tickets, all framed inside the Lottery Ticket Hypothesis.

If this is right

  • Unsafe generations drop substantially on the evaluated models.
  • Resistance to jailbreak attacks increases.
  • Utility on general tasks remains nearly unchanged.
  • The same procedure works on different model families and on quantized versions.
  • Safety gains occur through a single post-training pass that uses limited hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the unsafe tickets turn out to be stable across training runs, future models could be built with explicit safety-ticket masks from the start.
  • The same attribution-plus-prune loop might locate and remove other localized problems such as factual errors or biased patterns.
  • Deployment pipelines could run this pruning step on-device for edge models where retraining is impossible.

Load-bearing premise

The attribution step correctly flags only the parameters that cause unsafe outputs and leaves the parameters needed for normal task performance untouched.

What would settle it

Apply the pruning to a model, then test whether the rate of unsafe outputs stays the same or rises, or whether accuracy on standard benchmarks falls sharply.

Figures

Figures reproduced from arXiv: 2604.15780 by Michael Backes, Mingjie Li, Wai Man Si, Yang Zhang.

Figure 1
Figure 1. Figure 1: Illustration of post-training methods for safety. The instruct model fails to produce safe outputs when given unsafe prompts. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the pruning framework pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of framework hyperparameters on unsafe output rate. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token loss dynamics before and after pruning. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a resource-efficient post-hoc pruning framework that uses a gradient-free attribution mechanism to identify and remove 'unsafe tickets' (subnetworks responsible for harmful outputs) from aligned LLMs such as Mistral and LLaVA. It claims this yields substantial reductions in unsafe generations, improved robustness to jailbreak attacks, and minimal utility loss while revealing 'safety tickets' consistent with an extension of the Lottery Ticket Hypothesis; the method is presented as generalizing across architectures and quantized variants with only modest compute.

Significance. If the central empirical claims hold with proper controls, the work would provide a lightweight, training-free alternative to RLHF/SFT for safety alignment that is attractive for resource-constrained deployment. The framing around unsafe/safety tickets and the reported generalization are potentially novel contributions, but the absence of quantitative metrics, baselines, and specificity tests in the current presentation limits the assessed impact.

major comments (3)
  1. [Method] The gradient-free attribution procedure is described at a high level but without an explicit scoring formula, definition of the 'unsafe' signal, or pseudocode; this is load-bearing because all downstream claims (safety gains, jailbreak robustness, minimal utility loss) rest on the attribution isolating behaviorally specific parameters rather than producing generic sparsity.
  2. [Experiments] No ablation against random pruning at matched sparsity is reported; without this control, observed reductions in unsafe outputs on Mistral/LLaVA could be explained by capacity reduction or incidental effects rather than targeted 'unsafe ticket' excision.
  3. [Results] The abstract and results sections assert 'substantial reductions in unsafe generations' and 'minimal utility loss' yet supply no concrete metrics, baselines (e.g., standard safety benchmarks, random or magnitude-based pruning), or measurement protocol for unsafe generations; this prevents evaluation of the data-to-claim link.
minor comments (2)
  1. [Introduction] Notation for 'unsafe tickets' and 'safety tickets' is introduced without a formal definition or reference to prior Lottery Ticket Hypothesis extensions; a brief clarifying paragraph would improve readability.
  2. [Figures/Tables] Figure captions and table headers should explicitly state the exact safety and utility metrics used (e.g., percentage of unsafe responses, specific benchmark names) to allow direct comparison with related work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and valuable suggestions. We will revise the manuscript to address all the major comments by adding the requested details, controls, and metrics. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [Method] The gradient-free attribution procedure is described at a high level but without an explicit scoring formula, definition of the 'unsafe' signal, or pseudocode; this is load-bearing because all downstream claims (safety gains, jailbreak robustness, minimal utility loss) rest on the attribution isolating behaviorally specific parameters rather than producing generic sparsity.

    Authors: We agree with the referee that more explicit details are needed for the gradient-free attribution procedure. The revised manuscript will include the precise scoring formula, a definition of the 'unsafe' signal as the attribution score derived from prompts that elicit unsafe responses, and pseudocode outlining the steps. This will help confirm that the pruning targets specific unsafe behaviors rather than general sparsity. revision: yes

  2. Referee: [Experiments] No ablation against random pruning at matched sparsity is reported; without this control, observed reductions in unsafe outputs on Mistral/LLaVA could be explained by capacity reduction or incidental effects rather than targeted 'unsafe ticket' excision.

    Authors: We appreciate this point. To demonstrate that the reductions are due to targeted removal of unsafe tickets rather than mere capacity reduction, we will add an ablation study with random pruning at matched sparsity levels. Results will be reported for both Mistral and LLaVA on unsafe output metrics and utility preservation. revision: yes

  3. Referee: [Results] The abstract and results sections assert 'substantial reductions in unsafe generations' and 'minimal utility loss' yet supply no concrete metrics, baselines (e.g., standard safety benchmarks, random or magnitude-based pruning), or measurement protocol for unsafe generations; this prevents evaluation of the data-to-claim link.

    Authors: We agree that the results section would benefit from more concrete metrics and additional baselines. We will revise to include specific numbers on the reductions in unsafe generations, comparisons to standard baselines including random pruning and magnitude-based pruning, and a detailed measurement protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent evaluation

full rationale

The paper presents a resource-efficient pruning framework that uses a gradient-free attribution mechanism to identify and remove parameters linked to unsafe behaviors, then evaluates the pruned models on safety metrics, jailbreak robustness, and utility preservation. No mathematical equations, predictions, or first-principles derivations are described that would reduce the observed safety improvements to quantities defined by the same attribution scores or fitted parameters used for ticket selection. The reference to the Lottery Ticket Hypothesis serves only as an interpretive lens after the empirical results and does not constitute a load-bearing self-citation or self-definitional loop. All central claims rest on post-pruning experimental testing across models, which remains independent of the selection process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that unsafe behaviors are localized in identifiable subnetworks and that a gradient-free score can surface them; no free parameters or invented entities beyond the ticket metaphor are detailed in the abstract.

axioms (1)
  • domain assumption Large language models contain subnetworks ('unsafe tickets') that are primarily responsible for harmful outputs.
    Core premise extending the Lottery Ticket Hypothesis to safety; invoked to justify pruning as a safety intervention.
invented entities (2)
  • unsafe tickets no independent evidence
    purpose: Subnetworks that trigger unsafe generations
    Conceptual entity introduced to explain why pruning improves safety.
  • safety tickets no independent evidence
    purpose: Subnetworks that preserve utility after unsafe parameters are removed
    Complementary concept used to argue that pruning does not destroy capability.

pith-pipeline@v0.9.0 · 5475 in / 1273 out tokens · 47612 ms · 2026-05-10T09:07:59.209211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 21 canonical work pages · 14 internal anchors

  1. [1]

    Gener- ated Data with Fake Privacy: Hidden Dangers of Fine-Tuning Large Language Models on Generated Data

    Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Jun- jie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Gener- ated Data with Fake Privacy: Hidden Dangers of Fine-Tuning Large Language Models on Generated Data. InUSENIX Security Symposium (USENIX Security), pages 8075–8093. USENIX, 2025. 7

  2. [2]

    Lan- guage Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

    Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. Lan- guage Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 414138–14149. ACL, 2024. 14

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...

  4. [4]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Kather- ine Lee, Florian Tramèr, and Chiyuan Zhang. Quantify- ing Memorization Across Neural Language Models.CoRR abs/2202.07646, 2023. 2

  5. [5]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Flo- rian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Lan- guage Models.CoRR abs/2404.01318, 2024. 4, 14

  6. [6]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 6

  7. [7]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAnnual Conference on Neural Information Processing Systems (NIPS), pages 4299–

  8. [8]

    Understanding LLM Be- havior When Encountering User-Supplied Harmful Content in Harmless Tasks.CoRR abs/2603.11914, 2026

    Junjie Chu, Yiting Qu, Ye Leng, Michael Backes, Yun Shen, Savvas Zannettou, and Yang Zhang. Understanding LLM Be- havior When Encountering User-Supplied Harmful Content in Harmless Tasks.CoRR abs/2603.11914, 2026. 7

  9. [9]

    Benchmark of Benchmarks: Unpack- ing Influence and Code Repository Quality in LLM Safety Benchmarks.CoRR abs/2603.04459, 2026

    Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, and Yang Zhang. Benchmark of Benchmarks: Unpack- ing Influence and Code Repository Quality in LLM Safety Benchmarks.CoRR abs/2603.04459, 2026. 7

  10. [10]

    Jailbreaker: Automated jailbreak across multiple large language model chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 7

  11. [11]

    BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding. InConference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), pages 4171–4186. ACL, 2019. 7

  12. [12]

    Multi-Prize Lot- tery Ticket Hypothesis: Finding Accurate Binary Neural Net- works by Pruning A Randomly Weighted Network

    James Diffenderfer and Bhavya Kailkhura. Multi-Prize Lot- tery Ticket Hypothesis: Finding Accurate Binary Neural Net- works by Pruning A Randomly Weighted Network. InIn- ternational Conference on Learning Representations (ICLR),

  13. [13]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 3029–3051. ACL,

  14. [14]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Betha...

  15. [15]

    The Lottery Ticket Hy- pothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The Lottery Ticket Hy- pothesis: Finding Sparse, Trainable Neural Networks. InIn- ternational Conference on Learning Representations (ICLR),

  16. [16]

    SparseGPT: Massive Lan- guage Models Can be Accurately Pruned in One-Shot

    Elias Frantar and Dan Alistarh. SparseGPT: Massive Lan- guage Models Can be Accurately Pruned in One-Shot. InIn- ternational Conference on Machine Learning (ICML), pages 10323–10337. PMLR, 2023. 7

  17. [17]

    Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space

    Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Gold- berg. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 30–45. ACL, 2022. 7

  18. [18]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5484–5495. ACL, 2021. 7

  19. [19]

    Song Han, Huizi Mao, and William J. Dally. Deep Com- pression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. InInternational Conference on Learning Representations (ICLR), 2016. 7

  20. [20]

    Song Han, Jeff Pool, John Tran, and William J. Dally. Learn- ing both Weights and Connections for Efficient Neural Net- works. InAnnual Conference on Neural Information Process- ing Systems (NIPS), pages 1135–1143. NIPS, 2015. 7

  21. [21]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.CoRR abs/2310.06987, 2023. 7

  22. [22]

    PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, Sirui Han, Yike Guo, and Yaodong Yang. PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference. InAnnual Meeting of the As- sociation for Computational Linguistics (ACL), pages 31983– 32016. ACL, 2025. 11

  23. [23]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, élio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7B.CoRR abs/2310.06825, 2023. 1

  24. [24]

    Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency

    Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Confer- ence on Neural Information Processing Systems (NeurIPS). NeurIPS, 2025. 7

  25. [25]

    Kummerfeld, and Rada Mihalcea

    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. InInternational Conference on Machine Learning (ICML), pages 26361–26378. PMLR, 2024. 1, 3

  26. [26]

    Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: single-Shot Network Pruning based on Connec- tion sensitivity. InInternational Conference on Learning Rep- resentations (ICLR), 2019. 3, 7

  27. [27]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine- jad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 7871–7880. ACL,

  28. [28]

    Pruning Filters for Efficient ConvNets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. InIn- ternational Conference on Learning Representations (ICLR),

  29. [29]

    Viégas, Hanspeter Pfis- ter, and Martin Wattenberg

    Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfis- ter, and Martin Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. InAn- nual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 1, 3, 7

  30. [30]

    SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation

    Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 7

  31. [31]

    Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms

    Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms. InAnnual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2026. 7

  32. [32]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning.CoRR abs/2310.03744, 2023. 1, 5

  33. [33]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.CoRR abs/2310.04451, 2023. 6, 7

  34. [34]

    MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models. InEu- ropean Conference on Computer Vision (ECCV), pages 386–

  35. [35]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zi- fan Wang, Norman Mu, Elham Sakhaee, athaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harm- Bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/abs/2402.04249,

  36. [36]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Riv- ière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak ...

  37. [37]

    Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian

    Ari S. Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. InAnnual Con- ference on Neural Information Processing Systems (NeurIPS), pages 4933–4943. NeurIPS, 2019. 2

  38. [38]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob 9 Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human...

  39. [39]

    Safety Alignment Should be Made More Than Just a Few Tokens Deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety Alignment Should be Made More Than Just a Few Tokens Deep. InInternational Conference on Learning Representations (ICLR), 2025. 7

  40. [40]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct Prefer- ence Optimization: Your Language Model is Secretly a Re- ward Model. InAnnual Conference on Neural Information Processing Systems (NeurIPS), pages 53728–53741. NeurIPS,

  41. [41]

    M., Li, M., Backes, M., and Zhang, Y

    Wai Man Si, Mingjie Li, Michael Backes, and Yang Zhang. Excessive Reasoning Attack on Reasoning LLMs.CoRR abs/2506.14374, 2025. 7

  42. [42]

    Zico Kolter

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective Pruning Approach for Large Language Models. InInternational Conference on Learning Represen- tations (ICLR), 2024. 3, 7

  43. [43]

    arXiv preprint arXiv:2310.10348 , year=

    Aaquib Syed, Can Rager, and Arthur Conmy. Attribution Patching Outperforms Automated Circuit Discovery.CoRR abs/2310.10348, 2023. 2

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and Efficient Foundation Language Mod- els.CoRR abs/2302.13971, 2023. 1

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

  46. [46]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation Addition: Steering Language Models Without Optimization. CoRR abs/2308.10248, 2023. 1

  47. [47]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAnnual Confer- ence on Neural Information Processing Systems (NIPS), pages 5998–6008. NIPS, 2017. 7

  48. [48]

    Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

    Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. InIn- ternational Conference on Machine Learning (ICML), pages 52588–52610. PMLR, 2024. 7

  49. [49]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How Johnny Can Persuade LLMs to Jail- break Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs.CoRR abs/2401.06373, 2024. 6

  50. [50]

    Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization

    Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 8865–8887. ACL, 2024. 1, 4, 11

  51. [51]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAnnual Conference on Neural Infor- mation Processing Systems (NeurIPS). NeurIPS, 2023. 4

  52. [52]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-Tuning Language Models from Human Preferences.CoRR abs/1909.08593, 2019. 7

  53. [53]

    Zico Kolter, Matt Fredrik- son, and Dan Hendrycks

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrik- son, and Dan Hendrycks. Improving Alignment and Robust- ness with Circuit Breakers. InAnnual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2024. 4, 5, 11

  54. [54]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models.CoRR abs/2307.15043, 2023. 4, 6, 7 10 A Ablation Study We conduct an ablation study to examine the sensitivity of our pruning framework to three key hyperparameters: (1) the fraction of parameters pruned per iteration, (2) t...

  55. [55]

    This will help you understand the material better and stay engaged

    Attend classes regularly: Be present in class and actively participate in discussions. This will help you understand the material better and stay engaged. -

  56. [56]

    Summarize key points, high- light important information, and review your notes regularly

    Take organized notes: Develop a note-taking sys- tem that works for you. Summarize key points, high- light important information, and review your notes regularly. -

  57. [57]

    Break down your study ses- sions into smaller, manageable chunks to avoid feel- ing overwhelmed

    Create a study schedule: Plan your study time in advance and stick to it. Break down your study ses- sions into smaller, manageable chunks to avoid feel- ing overwhelmed. -

  58. [58]

    This could be a library, a study room, or a quiet corner at home

    Find a suitable study environment: Choose a quiet and well-lit place where you can concentrate without distractions. This could be a library, a study room, or a quiet corner at home. -

  59. [59]

    Find what works best for you

    Use effective study techniques: Experiment with different study techniques such as summarizing in- formation, creating flashcards, or teaching the mate- rial to someone else. Find what works best for you. -

  60. [60]

    Clarifying doubts early on will prevent con- fusion later

    Seek clarification: If you don’t understand some- thing, don’t hesitate to ask your teacher or classmates for help. Clarifying doubts early on will prevent con- fusion later. -

  61. [61]

    Break down larger tasks into smaller, manageable steps to stay on top of your workload

    Stay organized: Keep track of assignments, dead- lines, and exams using a planner or digital calendar. Break down larger tasks into smaller, manageable steps to stay on top of your workload. -

  62. [62]

    Review your notes and course material regularly to reinforce your understanding and retain information

    Review and revise regularly: Don’t leave studying until the last minute. Review your notes and course material regularly to reinforce your understanding and retain information. -

  63. [63]

    Avoid procrastination and manage your time effec- tively

    Practice time management: Prioritize your tasks and allocate time for studying, completing assign- ments, and participating in extracurricular activities. Avoid procrastination and manage your time effec- tively. -

  64. [64]

    Qwen1.5-7B-Chat

    Take care of yourself: Get enough sleep, eat a balanced diet, and exercise regularly. Taking care of your physical and mental well-being will improve your focus and overall academic performance. - Remember, getting good grades is not just about memorizing information but understanding and ap- plying it. Stay motivated, stay organized, and stay committed t...