Alignment Dynamics in LLM Fine-Tuning

Huanran Chen; Yinpeng Dong; Yuhan Huang

arxiv: 2605.18309 · v1 · pith:BSYNCCRQnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Alignment Dynamics in LLM Fine-Tuning

Yuhan Huang , Huanran Chen , Yinpeng Dong This is my paper

Pith reviewed 2026-05-20 12:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM alignmentfine-tuning dynamicsalignment reversalRehearsal Priming EffectRebound ForceDriving Forceposterior narrowness

0 comments

The pith

A new dynamical framework decomposes LLM alignment changes into Rebound and Driving Forces, explaining reversal and predicting faster re-alignment on re-exposure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework for alignment dynamics in LLMs by introducing a tractable alignment score and deriving its closed-form update under fine-tuning. This decomposition splits the update into a Rebound Force that depends on the current alignment state and the narrowness of the model distribution, competing against a Driving Force set by how the training data matches outcome-conditioned posteriors over aligned and non-aligned completions. A sympathetic reader would care because the framework explains why alignment is fragile under continued training and why narrower distributions strengthen reversal, while also predicting that prior alignment leaves a latent imprint that amplifies the Driving Force and accelerates re-alignment upon re-exposure to aligned data. The results are validated in safety, misalignment, and sentiment experiments.

Core claim

We introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a Rebound Force, governed jointly by the current alignment state and the narrowness of model distribution, and a Driving Force, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a Rehearsal Priming Effect: prior alig

What carries the argument

The closed-form update rule for the tractable alignment score, which decomposes the change into a Rebound Force and a Driving Force that together govern how alignment evolves under fine-tuning.

Load-bearing premise

The alignment score admits a tractable expression whose gradient with respect to parameters can be written in terms of outcome-conditioned posteriors over aligned and non-aligned completions.

What would settle it

An experiment that measures re-alignment speed after prior alignment versus starting from scratch and finds no acceleration under re-exposure would falsify the Rehearsal Priming Effect.

Figures

Figures reproduced from arXiv: 2605.18309 by Huanran Chen, Yinpeng Dong, Yuhan Huang.

**Figure 2.** Figure 2: Analysis of rebound dynamics. (a) Results on the Beavertails dataset. (b) Results on the Emergent Misalignment dataset. Black lines represent the initial forward fine-tuning (Stage 1), while colored lines denote the subsequent reverse fine-tuning (Stage 2). The distinct colors indicate varying fine-tuning steps of Stage 1 training prior to the onset of Stage 2. We observe a rapid rebound to the baseline pe… view at source ↗

**Figure 3.** Figure 3: Impact of distribution narrowness on rebound force. Results are shown for (a) Llama3.1- 8B and (b) Gemma-2-2b. “Stage 2 Reverse (Alpaca)” denotes the application of the alpaca-cleaned dataset for stage 2 reverse fine-tuning. The narrowness of dataset distribution increases across four levels: Fixed Template (highest), Diversified Template, Standard, and Style-Diversified (lowest). Lighter colors indicate h… view at source ↗

**Figure 4.** Figure 4: Dynamics of alignment score S during Stage 3 (Re-exposure). Subplots (a) and (b) present results on the IMDb dataset, initialized with Stage 1 fine-tuning on positive and negative sentiments, respectively. Subplots (c) and (d) show results for BeaverTails and Emergent Misalignment. The color gradient represents the duration of Stage 1 fine-tuning, where lighter lines denote a higher number of training ste… view at source ↗

**Figure 5.** Figure 5: Additional Results on the Rebound Effect. (a) Performance on the IMDb dataset with forward fine-tuning (Stage 1) using positive reviews. (b) Performance on the IMDb dataset with forward fine-tuning (Stage 1) using negative reviews. (c) Results on the BeaverTails dataset. A rapid rebound to the baseline is consistently observed across all evaluated datasets and models. 20 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 6.** Figure 6: Additional Results on the Rehearsal Priming Effect. The three panels above present results obtained using the Gemma-2-2B model. Consistent with the theoretical predictions detailed in the main text, lighter lines (representing more SFT steps during Stage 1 fine-tuning) exhibit a faster rate of increase compared to darker lines, confirming that initial exposure duration accelerates subsequent adaptation. D … view at source ↗

read the original abstract

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a tractable alignment score for LLMs and derives a closed-form update rule during fine-tuning. The update is decomposed into a Rebound Force (depending on current alignment state and distribution narrowness) and a Driving Force (determined by mismatch between training distribution and outcome-conditioned posteriors over aligned versus non-aligned completions). This decomposition is used to explain alignment reversal under subsequent fine-tuning and to predict a Rehearsal Priming Effect in which prior alignment leaves a latent posterior imprint that accelerates re-alignment upon re-exposure. The predictions are tested in safety-alignment, emergent-misalignment, and sentiment settings, with additional controlled experiments examining the dependence of rebound strength on posterior narrowness.

Significance. If the closed-form derivation holds without hidden approximations or circular re-labeling, the framework supplies a unified dynamical account that connects parameter-space gradient updates to function-space alignment behavior. The explicit prediction of the Rehearsal Priming Effect together with its experimental support, and the controlled demonstration that rebound strength scales with posterior narrowness, constitute falsifiable contributions that could guide more stable alignment procedures. The multi-setting validation adds breadth, though the strength of the conclusions remains tied to the validity of the score-gradient decomposition.

major comments (3)

[Abstract] Abstract (paragraph on decomposition): The claim that the gradient of the alignment score admits an exact decomposition into Rebound Force (current state + narrowness) and Driving Force (training distribution vs. outcome-conditioned posteriors) is presented as following directly from the score definition. Without the explicit functional form of the alignment score and the intermediate steps of the gradient derivation (presumably in §2–3), it is impossible to verify whether this decomposition requires restrictive assumptions such as binary outcomes, short completions, or already-narrow posteriors. Because the Rehearsal Priming Effect and all subsequent predictions rest on this decomposition, the absence of the derivation steps is load-bearing.
[Abstract] The introduction of Rebound and Driving Forces: These quantities are defined directly from the alignment score and its update; if they are simply re-expressions of terms already present in the score definition, the framework becomes tautological rather than predictive. A concrete check would be to show that the Rehearsal Priming Effect can be derived and falsified independently of the force labels themselves.
[Experimental validation] Experimental sections (safety alignment and controlled posterior-narrowness experiments): The reported dependence of rebound strength on posterior narrowness is central to the framework, yet the manuscript does not detail how the alignment score is evaluated on held-out completions or how posterior narrowness is measured in practice. Without these operational definitions, it is unclear whether the observed effects confirm the predicted forces or reflect generic fine-tuning dynamics.

minor comments (2)

[Abstract] The LaTeX color commands (e.g., red!60!black) appearing in the abstract should be removed for the final version.
[Introduction] Notation for the alignment score and the two forces should be introduced with explicit symbols and units in the main text to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for improving clarity around the derivation and experimental operationalization. We address each major comment point by point below, providing additional context from the manuscript and indicating revisions where they strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on decomposition): The claim that the gradient of the alignment score admits an exact decomposition into Rebound Force (current state + narrowness) and a Driving Force (training distribution vs. outcome-conditioned posteriors) is presented as following directly from the score definition. Without the explicit functional form of the alignment score and the intermediate steps of the gradient derivation (presumably in §2–3), it is impossible to verify whether this decomposition requires restrictive assumptions such as binary outcomes, short completions, or already-narrow posteriors. Because the Rehearsal Priming Effect and all subsequent predictions rest on this decomposition, the absence of the derivation steps is load-bearing.

Authors: The alignment score is explicitly defined in Section 2 as the difference in expected log-probability between aligned and non-aligned outcome classes under the model's posterior. Section 3 derives the closed-form gradient of this score under the standard fine-tuning objective by direct differentiation, separating the state-dependent entropy term (Rebound Force) from the data-posterior mismatch term (Driving Force). No restrictions to binary outcomes or short completions are imposed; the derivation holds for general categorical outcome spaces. We have revised the manuscript to insert a compact derivation outline immediately following the abstract and to expand the intermediate steps in a new subsection of Section 3, making the absence of hidden approximations explicit. revision: yes
Referee: [Abstract] The introduction of Rebound and Driving Forces: These quantities are defined directly from the alignment score and its update; if they are simply re-expressions of terms already present in the score definition, the framework becomes tautological rather than predictive. A concrete check would be to show that the Rehearsal Priming Effect can be derived and falsified independently of the force labels themselves.

Authors: The force labels are introduced only after the mathematical decomposition is complete; the Rehearsal Priming Effect follows directly from the persistence of the outcome-conditioned posterior after the initial alignment phase, which increases the effective mismatch term on re-exposure. This prediction can be stated and tested solely in terms of the update rule and posterior evolution, without invoking the force terminology. We have added a short derivation in Section 4 that obtains the priming prediction from the posterior update equation alone, followed by the experimental test, to demonstrate that the effect is not dependent on the chosen labels. revision: partial
Referee: [Experimental validation] Experimental sections (safety alignment and controlled posterior-narrowness experiments): The reported dependence of rebound strength on posterior narrowness is central to the framework, yet the manuscript does not detail how the alignment score is evaluated on held-out completions or how posterior narrowness is measured in practice. Without these operational definitions, it is unclear whether the observed effects confirm the predicted forces or reflect generic fine-tuning dynamics.

Authors: The alignment score on held-out data is computed as the mean log-probability gap between 500 fixed aligned and 500 fixed non-aligned reference completions drawn from the initial model. Posterior narrowness is measured by the Shannon entropy of the model's categorical distribution over the same outcome classes, estimated via 1000 Monte-Carlo samples per checkpoint. We have inserted a dedicated paragraph in the Experimental Setup section that supplies these definitions, the exact sampling procedure, and pseudocode, together with the reported correlation between measured entropy and observed rebound magnitude. revision: yes

Circularity Check

1 steps flagged

Rehearsal Priming Effect reduces to re-labeling of alignment score gradient decomposition by construction

specific steps

self definitional [Abstract (decomposition paragraph)]
"Our analysis decomposes alignment updates into two competing components: a Rebound Force, governed jointly by the current alignment state and the narrowness of model distribution, and a Driving Force, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a Rehearsal Priming Effect: prior alignment leaves a latent posterior imprint that amplf"

The alignment score is defined to admit a tractable expression whose gradient decomposes exactly via outcome-conditioned posteriors. The Rebound and Driving Forces are then the named parts of that derived update. The Rehearsal Priming Effect is therefore a direct renaming of the latent posterior structure already present in the Driving Force term, making the 'prediction' tautological with the score definition rather than a separate result.

full rationale

The paper introduces a tractable alignment score and derives a closed-form update whose gradient is written in terms of outcome-conditioned posteriors. It then names the resulting terms Rebound Force and Driving Force and presents the Rehearsal Priming Effect as a prediction that follows from prior alignment leaving a latent posterior imprint. Because the decomposition and the imprint effect are direct consequences of the score definition and the assumed gradient form, the claimed dynamical prediction is equivalent to the input assumptions rather than an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the existence of a tractable alignment score whose update admits an exact decomposition; no explicit free parameters are named, but the narrowness of the posterior and the alignment of training data with outcome-conditioned posteriors function as modeling choices that must be instantiated for any concrete model.

axioms (1)

domain assumption The alignment score admits a closed-form gradient with respect to model parameters that can be expressed using outcome-conditioned posteriors.
Invoked when the abstract states that the update is derived and decomposed into rebound and driving components.

invented entities (2)

Rebound Force no independent evidence
purpose: Component of the alignment update that opposes change and depends on current alignment state and posterior narrowness.
Introduced as one of the two competing terms in the derived update rule.
Driving Force no independent evidence
purpose: Component of the alignment update determined by alignment between training distribution and outcome-conditioned posteriors.
Introduced as the second competing term in the derived update rule.

pith-pipeline@v0.9.0 · 5809 in / 1439 out tokens · 34477 ms · 2026-05-20T12:18:40.102878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[2]

Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

work page 2026
[3]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[4]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020
[5]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling rein- forcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations, 2024

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations, 2024

work page 2024
[9]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[10]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martıén Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. InInternational Conference on Machine Learning, pages 4043–4068. PMLR, 2025

work page 2025
[11]

Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026

work page arXiv 2026
[12]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. InProceedings of the 41st International Conference on Machine Learning, pages 52588–52610, 2024

work page 2024
[13]

What is in your safe data? identifying benign data that breaks safety

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety. InFirst Conference on Language Modeling, 2024

work page 2024
[14]

Language models resist alignment: Evidence from data compression

Jiaming Ji, Kaile Wang, Tianyi Alex Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Josef Dai, Yunhuai Liu, and Yaodong Yang. Language models resist alignment: Evidence from data compression. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23411–23432, 2025. 10

work page 2025
[15]

Understanding catastrophic forgetting in language models via implicit inference

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[16]

Fundamental safety-capability trade-offs in fine-tuning large language models.PNAS nexus, 5(4):pgag097, 2026

Pin-Yu Chen, Han Shen, Payel Das, and Tianyi Chen. Fundamental safety-capability trade-offs in fine-tuning large language models.PNAS nexus, 5(4):pgag097, 2026

work page 2026
[17]

Dissecting learning and forgetting in language model finetuning

Xiao Zhang and Ji Wu. Dissecting learning and forgetting in language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[18]

Learning dynamics of llm finetuning

Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In13th International Conference on Learning Representations, ICLR 2025, pages 11401–11431. International Conference on Learning Representations, ICLR, 2025

work page 2025
[20]

Trak: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, pages 27074–27113, 2023

work page 2023
[21]

Lpntk: Better generalisation with less data via sample interaction during learning

Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith. Lpntk: Better generalisation with less data via sample interaction during learning. 2024

work page 2024
[22]

Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

work page arXiv 2025
[23]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, and Oscar Wahltinez. Shieldgemma: Generative ai content moderation based on gemma, 2024. URLhttps://arxiv.org/abs/2407.21772

work page internal anchor Pith review arXiv 2024
[24]

More than a feeling: Accuracy and application of sentiment analysis.International Journal of Research in Marketing, 40(1):75–87, 2023

Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. More than a feeling: Accuracy and application of sentiment analysis.International Journal of Research in Marketing, 40(1):75–87, 2023

work page 2023
[25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Gemma Team. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301

work page doi:10.34740/kaggle/m/3301 2024
[27]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[29]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[30]

Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026

work page arXiv 2026
[31]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142– 150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. U...

work page 2011
[32]

Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

work page 2023
[33]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[34]

Knowledge composition using task vectors with learned anisotropic scaling

Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. Advances in Neural Information Processing Systems, 37:67319–67354, 2024

work page 2024
[35]

On the emergence of cross-task linearity in pretraining-finetuning paradigm

Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in pretraining-finetuning paradigm. InInternational Conference on Machine Learning, pages 61854–61884. PMLR, 2024

work page 2024
[36]

Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

work page 2023
[37]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022
[38]

Not all steps are informative: On the linearity of llms’ rlvr training.arXiv preprint arXiv:2601.04537, 2026

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of llms’ rlvr training.arXiv preprint arXiv:2601.04537, 2026

work page arXiv 2026
[39]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey.arXiv preprint arXiv:2409.18169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

work page 2023
[41]

The unlocking spell on base llms: Rethinking alignment via in-context learning

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[42]

Superficial safety alignment hypothesis

Jianwei Li and Jung-Eun Kim. Superficial safety alignment hypothesis. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

work page 2024
[45]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR, 2023

work page 2023
[46]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024. 12

work page 2024
[47]

The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions. InInternational Conference on Machine Learning, pages 47697–47716. PMLR, 2025

work page 2025
[48]

Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024

work page 2024
[49]

Safe" only if the evaluator explicitly outputs

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InThe Thirteenth International Conference on Learning Representations, 2025. A Proof for the Alignment Dynamics Formula A.1 Learning Dynamics in Single-token Task Consider the simplest generation task, where the model learn to generates ...

work page 2025
[50]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[2] [2]

Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

work page 2026

[3] [3]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017

[4] [4]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020

[5] [5]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling rein- forcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations, 2024

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations, 2024

work page 2024

[9] [9]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak llms. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[10] [10]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martıén Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. InInternational Conference on Machine Learning, pages 4043–4068. PMLR, 2025

work page 2025

[11] [11]

Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026

Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Understanding and preserving safety in fine-tuned llms.arXiv preprint arXiv:2601.10141, 2026

work page arXiv 2026

[12] [12]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. InProceedings of the 41st International Conference on Machine Learning, pages 52588–52610, 2024

work page 2024

[13] [13]

What is in your safe data? identifying benign data that breaks safety

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety. InFirst Conference on Language Modeling, 2024

work page 2024

[14] [14]

Language models resist alignment: Evidence from data compression

Jiaming Ji, Kaile Wang, Tianyi Alex Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Josef Dai, Yunhuai Liu, and Yaodong Yang. Language models resist alignment: Evidence from data compression. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23411–23432, 2025. 10

work page 2025

[15] [15]

Understanding catastrophic forgetting in language models via implicit inference

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[16] [16]

Fundamental safety-capability trade-offs in fine-tuning large language models.PNAS nexus, 5(4):pgag097, 2026

Pin-Yu Chen, Han Shen, Payel Das, and Tianyi Chen. Fundamental safety-capability trade-offs in fine-tuning large language models.PNAS nexus, 5(4):pgag097, 2026

work page 2026

[17] [17]

Dissecting learning and forgetting in language model finetuning

Xiao Zhang and Ji Wu. Dissecting learning and forgetting in language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[18] [18]

Learning dynamics of llm finetuning

Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In13th International Conference on Learning Representations, ICLR 2025, pages 11401–11431. International Conference on Learning Representations, ICLR, 2025

work page 2025

[20] [20]

Trak: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, pages 27074–27113, 2023

work page 2023

[21] [21]

Lpntk: Better generalisation with less data via sample interaction during learning

Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith. Lpntk: Better generalisation with less data via sample interaction during learning. 2024

work page 2024

[22] [22]

Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

work page arXiv 2025

[23] [23]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, and Oscar Wahltinez. Shieldgemma: Generative ai content moderation based on gemma, 2024. URLhttps://arxiv.org/abs/2407.21772

work page internal anchor Pith review arXiv 2024

[24] [24]

More than a feeling: Accuracy and application of sentiment analysis.International Journal of Research in Marketing, 40(1):75–87, 2023

Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. More than a feeling: Accuracy and application of sentiment analysis.International Journal of Research in Marketing, 40(1):75–87, 2023

work page 2023

[25] [25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Gemma Team. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301

work page doi:10.34740/kaggle/m/3301 2024

[27] [27]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[29] [29]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023

[30] [30]

Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026

work page arXiv 2026

[31] [31]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142– 150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. U...

work page 2011

[32] [32]

Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

work page 2023

[33] [33]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[34] [34]

Knowledge composition using task vectors with learned anisotropic scaling

Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. Advances in Neural Information Processing Systems, 37:67319–67354, 2024

work page 2024

[35] [35]

On the emergence of cross-task linearity in pretraining-finetuning paradigm

Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in pretraining-finetuning paradigm. InInternational Conference on Machine Learning, pages 61854–61884. PMLR, 2024

work page 2024

[36] [36]

Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

Zhanpeng Zhou, Yongyi Yang, Xiaojiang Yang, Junchi Yan, and Wei Hu. Going beyond linear mode connectivity: The layerwise linear feature connectivity.Advances in neural information processing systems, 36:60853–60877, 2023

work page 2023

[37] [37]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022

[38] [38]

Not all steps are informative: On the linearity of llms’ rlvr training.arXiv preprint arXiv:2601.04537, 2026

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of llms’ rlvr training.arXiv preprint arXiv:2601.04537, 2026

work page arXiv 2026

[39] [39]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey.arXiv preprint arXiv:2409.18169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

work page 2023

[41] [41]

The unlocking spell on base llms: Rethinking alignment via in-context learning

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[42] [42]

Superficial safety alignment hypothesis

Jianwei Li and Jung-Eun Kim. Superficial safety alignment hypothesis. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[43] [43]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

work page 2024

[45] [45]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR, 2023

work page 2023

[46] [46]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024. 12

work page 2024

[47] [47]

The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions. InInternational Conference on Machine Learning, pages 47697–47716. PMLR, 2025

work page 2025

[48] [48]

Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. Finding safety neurons in large language models.arXiv e-prints, pages arXiv–2406, 2024

work page 2024

[49] [49]

Safe" only if the evaluator explicitly outputs

Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. Safety layers in aligned large language models: The key to llm security. InThe Thirteenth International Conference on Learning Representations, 2025. A Proof for the Alignment Dynamics Formula A.1 Learning Dynamics in Single-token Task Consider the simplest generation task, where the model learn to generates ...

work page 2025

[50] [50]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page