Alignment Dynamics in LLM Fine-Tuning
Pith reviewed 2026-05-20 12:18 UTC · model grok-4.3
The pith
A new dynamical framework decomposes LLM alignment changes into Rebound and Driving Forces, explaining reversal and predicting faster re-alignment on re-exposure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a Rebound Force, governed jointly by the current alignment state and the narrowness of model distribution, and a Driving Force, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a Rehearsal Priming Effect: prior alig
What carries the argument
The closed-form update rule for the tractable alignment score, which decomposes the change into a Rebound Force and a Driving Force that together govern how alignment evolves under fine-tuning.
Load-bearing premise
The alignment score admits a tractable expression whose gradient with respect to parameters can be written in terms of outcome-conditioned posteriors over aligned and non-aligned completions.
What would settle it
An experiment that measures re-alignment speed after prior alignment versus starting from scratch and finds no acceleration under re-exposure would falsify the Rehearsal Priming Effect.
Figures
read the original abstract
Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a tractable alignment score for LLMs and derives a closed-form update rule during fine-tuning. The update is decomposed into a Rebound Force (depending on current alignment state and distribution narrowness) and a Driving Force (determined by mismatch between training distribution and outcome-conditioned posteriors over aligned versus non-aligned completions). This decomposition is used to explain alignment reversal under subsequent fine-tuning and to predict a Rehearsal Priming Effect in which prior alignment leaves a latent posterior imprint that accelerates re-alignment upon re-exposure. The predictions are tested in safety-alignment, emergent-misalignment, and sentiment settings, with additional controlled experiments examining the dependence of rebound strength on posterior narrowness.
Significance. If the closed-form derivation holds without hidden approximations or circular re-labeling, the framework supplies a unified dynamical account that connects parameter-space gradient updates to function-space alignment behavior. The explicit prediction of the Rehearsal Priming Effect together with its experimental support, and the controlled demonstration that rebound strength scales with posterior narrowness, constitute falsifiable contributions that could guide more stable alignment procedures. The multi-setting validation adds breadth, though the strength of the conclusions remains tied to the validity of the score-gradient decomposition.
major comments (3)
- [Abstract] Abstract (paragraph on decomposition): The claim that the gradient of the alignment score admits an exact decomposition into Rebound Force (current state + narrowness) and Driving Force (training distribution vs. outcome-conditioned posteriors) is presented as following directly from the score definition. Without the explicit functional form of the alignment score and the intermediate steps of the gradient derivation (presumably in §2–3), it is impossible to verify whether this decomposition requires restrictive assumptions such as binary outcomes, short completions, or already-narrow posteriors. Because the Rehearsal Priming Effect and all subsequent predictions rest on this decomposition, the absence of the derivation steps is load-bearing.
- [Abstract] The introduction of Rebound and Driving Forces: These quantities are defined directly from the alignment score and its update; if they are simply re-expressions of terms already present in the score definition, the framework becomes tautological rather than predictive. A concrete check would be to show that the Rehearsal Priming Effect can be derived and falsified independently of the force labels themselves.
- [Experimental validation] Experimental sections (safety alignment and controlled posterior-narrowness experiments): The reported dependence of rebound strength on posterior narrowness is central to the framework, yet the manuscript does not detail how the alignment score is evaluated on held-out completions or how posterior narrowness is measured in practice. Without these operational definitions, it is unclear whether the observed effects confirm the predicted forces or reflect generic fine-tuning dynamics.
minor comments (2)
- [Abstract] The LaTeX color commands (e.g., red!60!black) appearing in the abstract should be removed for the final version.
- [Introduction] Notation for the alignment score and the two forces should be introduced with explicit symbols and units in the main text to improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important areas for improving clarity around the derivation and experimental operationalization. We address each major comment point by point below, providing additional context from the manuscript and indicating revisions where they strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on decomposition): The claim that the gradient of the alignment score admits an exact decomposition into Rebound Force (current state + narrowness) and a Driving Force (training distribution vs. outcome-conditioned posteriors) is presented as following directly from the score definition. Without the explicit functional form of the alignment score and the intermediate steps of the gradient derivation (presumably in §2–3), it is impossible to verify whether this decomposition requires restrictive assumptions such as binary outcomes, short completions, or already-narrow posteriors. Because the Rehearsal Priming Effect and all subsequent predictions rest on this decomposition, the absence of the derivation steps is load-bearing.
Authors: The alignment score is explicitly defined in Section 2 as the difference in expected log-probability between aligned and non-aligned outcome classes under the model's posterior. Section 3 derives the closed-form gradient of this score under the standard fine-tuning objective by direct differentiation, separating the state-dependent entropy term (Rebound Force) from the data-posterior mismatch term (Driving Force). No restrictions to binary outcomes or short completions are imposed; the derivation holds for general categorical outcome spaces. We have revised the manuscript to insert a compact derivation outline immediately following the abstract and to expand the intermediate steps in a new subsection of Section 3, making the absence of hidden approximations explicit. revision: yes
-
Referee: [Abstract] The introduction of Rebound and Driving Forces: These quantities are defined directly from the alignment score and its update; if they are simply re-expressions of terms already present in the score definition, the framework becomes tautological rather than predictive. A concrete check would be to show that the Rehearsal Priming Effect can be derived and falsified independently of the force labels themselves.
Authors: The force labels are introduced only after the mathematical decomposition is complete; the Rehearsal Priming Effect follows directly from the persistence of the outcome-conditioned posterior after the initial alignment phase, which increases the effective mismatch term on re-exposure. This prediction can be stated and tested solely in terms of the update rule and posterior evolution, without invoking the force terminology. We have added a short derivation in Section 4 that obtains the priming prediction from the posterior update equation alone, followed by the experimental test, to demonstrate that the effect is not dependent on the chosen labels. revision: partial
-
Referee: [Experimental validation] Experimental sections (safety alignment and controlled posterior-narrowness experiments): The reported dependence of rebound strength on posterior narrowness is central to the framework, yet the manuscript does not detail how the alignment score is evaluated on held-out completions or how posterior narrowness is measured in practice. Without these operational definitions, it is unclear whether the observed effects confirm the predicted forces or reflect generic fine-tuning dynamics.
Authors: The alignment score on held-out data is computed as the mean log-probability gap between 500 fixed aligned and 500 fixed non-aligned reference completions drawn from the initial model. Posterior narrowness is measured by the Shannon entropy of the model's categorical distribution over the same outcome classes, estimated via 1000 Monte-Carlo samples per checkpoint. We have inserted a dedicated paragraph in the Experimental Setup section that supplies these definitions, the exact sampling procedure, and pseudocode, together with the reported correlation between measured entropy and observed rebound magnitude. revision: yes
Circularity Check
Rehearsal Priming Effect reduces to re-labeling of alignment score gradient decomposition by construction
specific steps
-
self definitional
[Abstract (decomposition paragraph)]
"Our analysis decomposes alignment updates into two competing components: a Rebound Force, governed jointly by the current alignment state and the narrowness of model distribution, and a Driving Force, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a Rehearsal Priming Effect: prior alignment leaves a latent posterior imprint that amplf"
The alignment score is defined to admit a tractable expression whose gradient decomposes exactly via outcome-conditioned posteriors. The Rebound and Driving Forces are then the named parts of that derived update. The Rehearsal Priming Effect is therefore a direct renaming of the latent posterior structure already present in the Driving Force term, making the 'prediction' tautological with the score definition rather than a separate result.
full rationale
The paper introduces a tractable alignment score and derives a closed-form update whose gradient is written in terms of outcome-conditioned posteriors. It then names the resulting terms Rebound Force and Driving Force and presents the Rehearsal Priming Effect as a prediction that follows from prior alignment leaving a latent posterior imprint. Because the decomposition and the imprint effect are direct consequences of the score definition and the assumed gradient form, the claimed dynamical prediction is equivalent to the input assumptions rather than an independent derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The alignment score admits a closed-form gradient with respect to model parameters that can be expressed using outcome-conditioned posteriors.
invented entities (2)
-
Rebound Force
no independent evidence
-
Driving Force
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[2]
Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026
Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026
work page 2026
-
[3]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[4]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020
work page 2020
-
[5]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[6]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling rein- forcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations, 2024
work page 2024
-
[9]
Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms
Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[10]
Emergent misalignment: Narrow finetuning can produce broadly misaligned llms
Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martıén Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. InInternational Conference on Machine Learning, pages 4043–4068. PMLR, 2025
work page 2025
-
[11]
Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026
Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026
-
[12]
Assessing the brittleness of safety alignment via pruning and low-rank modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. InProceedings of the 41st International Conference on Machine Learning, pages 52588–52610, 2024
work page 2024
-
[13]
What is in your safe data? identifying benign data that breaks safety
Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety. InFirst Conference on Language Modeling, 2024
work page 2024
-
[14]
Language models resist alignment: Evidence from data compression
Jiaming Ji, Kaile Wang, Tianyi Alex Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Josef Dai, Yunhuai Liu, and Yaodong Yang. Language models resist alignment: Evidence from data compression. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23411–23432, 2025. 10
work page 2025
-
[15]
Understanding catastrophic forgetting in language models via implicit inference
Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[16]
Pin-Yu Chen, Han Shen, Payel Das, and Tianyi Chen. Fundamental safety-capability trade-offs in fine-tuning large language models.PNAS nexus, 5(4):pgag097, 2026
work page 2026
-
[17]
Dissecting learning and forgetting in language model finetuning
Xiao Zhang and Ji Wu. Dissecting learning and forgetting in language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[18]
Learning dynamics of llm finetuning
Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[19]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In13th International Conference on Learning Representations, ICLR 2025, pages 11401–11431. International Conference on Learning Representations, ICLR, 2025
work page 2025
-
[20]
Trak: attributing model behavior at scale
Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, pages 27074–27113, 2023
work page 2023
-
[21]
Lpntk: Better generalisation with less data via sample interaction during learning
Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith. Lpntk: Better generalisation with less data via sample interaction during learning. 2024
work page 2024
-
[22]
Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025
Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025
-
[23]
ShieldGemma: Generative AI Content Moderation Based on Gemma
Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, and Oscar Wahltinez. Shieldgemma: Generative ai content moderation based on gemma, 2024. URLhttps://arxiv.org/abs/2407.21772
work page internal anchor Pith review arXiv 2024
-
[24]
Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. More than a feeling: Accuracy and application of sentiment analysis.International Journal of Research in Marketing, 40(1):75–87, 2023
work page 2023
-
[25]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Gemma Team. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301
-
[27]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
-
[29]
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023
-
[30]
Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026
-
[31]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142– 150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. U...
work page 2011
-
[32]
Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023
work page 2023
-
[33]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[34]
Knowledge composition using task vectors with learned anisotropic scaling
Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. Advances in Neural Information Processing Systems, 37:67319–67354, 2024
work page 2024
-
[35]
On the emergence of cross-task linearity in pretraining-finetuning paradigm
Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in pretraining-finetuning paradigm. InInternational Conference on Machine Learning, pages 61854–61884. PMLR, 2024
work page 2024
-
[36]
Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023
work page 2023
-
[37]
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...
work page 2022
-
[38]
Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of llms’ rlvr training.arXiv preprint arXiv:2601.04537, 2026
-
[39]
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey.arXiv preprint arXiv:2409.18169, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023
work page 2023
-
[41]
The unlocking spell on base llms: Rethinking alignment via in-context learning
Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[42]
Superficial safety alignment hypothesis
Jianwei Li and Jung-Eun Kim. Superficial safety alignment hypothesis. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[43]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024
work page 2024
-
[45]
Poisoning language models during instruction tuning
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR, 2023
work page 2023
-
[46]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024. 12
work page 2024
-
[47]
The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions
Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions. InInternational Conference on Machine Learning, pages 47697–47716. PMLR, 2025
work page 2025
-
[48]
Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024
Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024
work page 2024
-
[49]
Safe" only if the evaluator explicitly outputs
Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InThe Thirteenth International Conference on Learning Representations, 2025. A Proof for the Alignment Dynamics Formula A.1 Learning Dynamics in Single-token Task Consider the simplest generation task, where the model learn to generates ...
work page 2025
-
[50]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.