pith. sign in

arxiv: 2606.20814 · v1 · pith:3IBYXXN3new · submitted 2026-06-18 · 💻 cs.AI · cs.LG· stat.ML

What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data

Pith reviewed 2026-06-26 17:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords emergent misalignmentnarrow fine-tuningmodel activationssubspace overlaptraining dynamicsalignment predictionlanguage modelsmodel priors
0
0 comments X

The pith

Pre-fine-tuning activations predict fine-grained alignment scores after narrow fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the factors shaping emergent misalignment, where narrow fine-tuning on specific data leads to broad misalignment on other tasks. It analyzes training dynamics by relating in-domain loss to out-of-domain scores, model priors through activation patterns before fine-tuning, and data via subspace overlaps in activation changes. The central result is that activations from pre-trained and instruct models before fine-tuning can predict the alignment scores after the fine-tuning process. This points to model priors as a key driver of how misalignment generalizes unevenly.

Core claim

Evaluation prompt-only activations from pre-trained and original instruct models can predict fine-grained alignment scores after narrow fine-tuning, with moderate-to-high subspace overlap in activation shifts between training and evaluation prompts that correlates with their similarities.

What carries the argument

Prompt-only activations from models prior to narrow fine-tuning and the subspace overlaps between training and evaluation prompt activations.

If this is right

  • In-domain training loss shows some positive correlation with out-of-domain misalignment but does not determine it strongly.
  • Different learning schedules for narrow fine-tuning do not produce runs with substantially better broad alignment at similar loss levels.
  • Activation deltas before and after fine-tuning exhibit moderate-to-high overlap between training and evaluation prompts.
  • Subspace overlaps between training and evaluation prompts correlate with similarity of their activation shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If activations reliably predict misalignment risk, models could be screened for susceptibility before any fine-tuning occurs.
  • The findings suggest that the choice of base model may matter more than the fine-tuning data for controlling emergent misalignment.
  • Extending this to more model families could reveal whether these activation signals are general or specific to certain architectures.

Load-bearing premise

The observed predictive power of pre-fine-tuning activations and subspace overlaps is not limited to the specific model families, datasets, or evaluation questions used in the experiments.

What would settle it

Observing no correlation between pre-fine-tuning activations and post-fine-tuning alignment scores in a new model family or dataset would falsify the predictive claim.

Figures

Figures reproduced from arXiv: 2606.20814 by Anietta Weckauff, Diego Garcia-Olano, Maksym Andriushchenko, Yuchen Zhang.

Figure 1
Figure 1. Figure 1: Eval final loss vs. harmless scores. Model Qwen2.5-32B￾Instruct trained on risky financial data, evaluated on the Initial EM questions (first plot questions.) 4. Experiment: Pre-trained Model Comparison and Prediction of Fine-grained Alignment Scores from Prior Activations In this section, we study: “How different is the narrow fine￾tuned distribution compared with the pre-trained distribution in terms of … view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity between reconstructed and true evalua￾tion prompt activation deltas versus the number of top PCs taken from train deltas. Each line reports the average over three train model–train data–evaluation data triplets. likely shift the model in broadly shared directions induced by the responses. These observations are consistent with the hypothesis that, without sufficiently diverse and separabl… view at source ↗
Figure 3
Figure 3. Figure 3: Eval final loss vs. harmless scores. Model Qwen2.5-32B-Instruct trained on risky financial data, Evaluated on the Initial EM questions (first plot questions.) 11 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Eval final loss vs. harmless scores. Model Qwen2.5-32B-Instruct trained on auto advice data, Evaluated on the Initial EM questions (first plot questions.) 1.350 1.375 1.400 1.425 1.450 1.475 1.500 1.525 eval_final_avg_loss 50 60 70 80 90 100 harmless scores 3e-05 2e-05 1.5e-05 5e-06 1e-06 Harmless vs Eval final avg loss (phi-4_finance_fpq) harmless harmless_coherent_greater_than_0.5 raw-harmless json-harml… view at source ↗
Figure 5
Figure 5. Figure 5: Eval final loss vs. harmless scores. Model Phi-4 trained on risky financial data, Evaluated on the Initial EM questions (first plot questions.) 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Eval final loss vs. harmless scores. Model Qwen2.5-32B-Instruct trained on bad medical advice data, Evaluated on the Initial EM questions (first plot questions.) 1.42 1.44 1.46 1.48 1.50 eval_final_avg_loss 75 80 85 90 95 100 harmless scores 3e-05 2e-05 1.5e-05 1e-05 5e-06 1e-06 Harmless vs Eval final avg loss (phi-4_medical_fpq) harmless harmless_coherent_greater_than_0.5 raw-harmless json-harmless templa… view at source ↗
Figure 7
Figure 7. Figure 7: Eval final loss vs. harmless scores. Model Phi-4 trained on bad medical advice data, Evaluated on the Initial EM questions (first plot questions.) 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Eval final loss vs. harmless scores. Model Qwen2.5-Coder-32B-Instruct trained on auto advice data, Evaluated on the Initial EM questions (first plot questions.) 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 eval_final_avg_loss 90 92 94 96 98 harmless scores 3e-05 2e-05 1.5e-05 1e-05 5e-06 1e-06 Harmless vs Eval final avg loss (Qwen2.5-Coder-32B-Instruct_insecure_fpq) harmless harmless_coherent_greater_than_0.5 r… view at source ↗
Figure 9
Figure 9. Figure 9: Eval final loss vs. harmless scores. Model Qwen2.5-Coder-32B-Instruct trained on insecure code data, Evaluated on the Initial EM questions (first plot questions.) A.2.2. MAX SCORE DIFFERENCES For the Initial EM 24 questions, the Max Score Difference between any point that has a lower train loss than other points is 16-17, and it is likely arbitrarily large due to small sample size, although we generated ea… view at source ↗
Figure 10
Figure 10. Figure 10: Example CyclicLR training curves. 0 50 100 150 200 250 300 Step 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Loss Loss (lower is better) exp_range_base_lr-1e-06_max_lr-2e-05 triangular_base_lr-1e-06_max_lr-2e-05 triangular2_base_lr-1e-06_max_lr-2e-05 0 50 100 150 200 250 300 Step 5 10 15 20 25 30 35 Gradient norm Gradient norm exp_range_base_lr-1e-06_max_lr-2e-05 triangular_base_lr-1e-06_max_lr-2e-05 triangular2_base_lr-1… view at source ↗
Figure 11
Figure 11. Figure 11: Example CyclicLR training curves. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example cosine/cosine with restarts training curves. 0 50 100 150 200 250 300 Step 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Loss Loss (lower is better) 5_1e-05 30_1e-05 with_restarts_5_1e-05 with_restarts_30_1e-05 with_restarts_80_1e-05 0 50 100 150 200 250 300 Step 5 10 15 20 25 30 35 40 Gradient norm Gradient norm 5_1e-05 30_1e-05 with_restarts_5_1e-05 with_restarts_30_1e-05 with_restarts_80_1e-05 0 50 100 150 2… view at source ↗
Figure 13
Figure 13. Figure 13: Example cosine/cosine with restarts training curves. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Log eval loss on the train data vs alignment scores for the Harmfulness questions. Pearson r is as high as 0.862, indicating that loss on train still dominates the misalignment score levels [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Log Eval Loss on the train data vs alignment scores for the for the Initial EM questions over different learning schedules. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Log Eval Loss on the train data vs alignment scores for the for the for the General User questions over different learning schedules. A.2.5. BENCHMARK SAMPLE SIZE EFFECT ON Max Score Difference To show that sample size of the benchmark may be a problem in calculating the Max Score Difference difference, we plot the sample size (x-axis) vs. the Max Score Difference (y-axis). We also attempt to fit a few po… view at source ↗
Figure 17
Figure 17. Figure 17: Max Score Diff vs sample size across different models and curve fitting. A.2.6. HEATMAP OF MODELS VS INITIAL 24 QUESTIONS Below, we present a heatmap of alignment scores for each question in the Initial 24 EM Questions benchmark across different narrow-finetuned models. Each row is normalized independently and displayed using its own color scale. The heatmap reveals that certain questions may consistently… view at source ↗
Figure 18
Figure 18. Figure 18: Heatmap of models vs Initial 24 questions A.2.7. PRE-TRAINED VS MISALIGNED The table below shows the actual scores for pre-trained, instruct, and narrowly misaligned model results. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Box plot of cosine similarity between reconstructed and actual eval prompt activation deltas for 3 train model-train data-eval data triplets, layer 32 and layer 64, and last prompt token and mean prompt tokens. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Box plot of train-eval delta cosine similarity for 3 train model-train data-eval data triplets, layer 32 and layer 64, and last prompt token and mean prompt tokens. A.2.10. PROMPT (PRIOR) ACTIVATION OVERLAPS VS. DELTA ACTIVATION OVERLAPS, PROMPT (PRIOR) ACTIVATION OVERLAPS VS. DELTA ACTIVATION COSINE SIMILARITY We present scatter plots illustrating the relationship between prompt subspace overlap and two … view at source ↗
Figure 21
Figure 21. Figure 21: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. Both the topk for prompt PCA and delta PCA are set to 128. 0.5 0.6 0.7 0.8 0.9 1.0 Prompt overlap 0.5 0.6 0.7 0.8 0.9 1.0 Delta reconstruction cosine last_prompt_token linear r = 0.829 sat r = 0.829 sat R² = 0.687 mean_prompt linear r = 0.444 sat r = 0.444 sat R² = 0.198 Prompt overlap vs PCA reconstruction … view at source ↗
Figure 22
Figure 22. Figure 22: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. Both the topk for prompt PCA and delta PCA are set to 0.95. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Scatter plot of prompt pca projection fraction vs. train-eval delta cosine similarity. The topk for prompt PCA is set to 128. 0.5 0.6 0.7 0.8 0.9 1.0 Prompt overlap 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Delta cosine similarity last_prompt_token linear r = 0.488 sat r = 0.488 sat R² = 0.238 mean_prompt linear r = -0.059 sat r = 0.059 sat R² = 0.004 Prompt overlap vs delta cosine similarity pca topk = 0.95 la… view at source ↗
Figure 24
Figure 24. Figure 24: Scatter plot of prompt pca projection fraction vs. train-eval delta cosine similarity. The topk for prompt PCA is set to 0.95. A.2.11. PROMPT (PRIOR) ACTIVATION OVERLAPS VS. DELTA ACTIVATION OVERLAPS -DETAILED We provide detailed scatter plots for each model evaluated, with results separated by layer and activation aggregation type. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. 0.4 0.5 0.6 0.7 0.8 0.9 1.0 prompt pca projection fraction 0.93 0.94 0.95 0.96 0.97 0.98 0.99 pca cosine reconstructed eval delta linear r = 0.380 sat r = 0.380 sat R² = 0.145 layer_32 unsloth--Qwen2.5-32B-Instruct tr: finance | ev: general layer: mlp_post_resid layer_32 agg: mean_prompt n=1414 0.4 0.5 0.6 0… view at source ↗
Figure 26
Figure 26. Figure 26: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 prompt pca projection fraction 0.95 0.96 0.97 0.98 0.99 pca cosine reconstructed eval delta linear r = 0.422 sat r = 0.422 sat R² = 0.178 layer_32 unsloth--Qwen2.5-32B-Instruct tr: finance | ev: general layer: mlp_post_resid layer_32 agg: mean_prompt n=1414 0.65 0… view at source ↗
Figure 28
Figure 28. Figure 28: Scatter plot of prompt pca projection fraction vs. PCA reconstructed eval delta cosine similarity. As a control, instead of projecting the evaluation prompts onto the training PCs, we project the evaluation prompts onto random directions with the same topk dimensionality. We provide the corresponding figures for these random directions and show that the resulting trend lines have substantially lower R2 va… view at source ↗
Figure 29
Figure 29. Figure 29: Scatter plot of random projection fraction vs. PCA reconstructed eval delta cosine similarity. Using last prompt activation and topk = 16. 0.040 0.045 0.050 0.055 0.060 0.065 random projection fraction 0.95 0.96 0.97 0.98 0.99 pca cosine reconstructed eval delta linear r = 0.032 sat r = 0.099 sat R² = 0.010 layer_32 unsloth--Qwen2.5-32B-Instruct tr: finance | ev: general layer: mlp_post_resid layer_32 agg… view at source ↗
Figure 30
Figure 30. Figure 30: Scatter plot of random projection fraction vs. PCA reconstructed eval delta cosine similarity. Using mean prompt tokens activation and topk = 16. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Scatter plot of random projection fraction vs. PCA reconstructed eval delta cosine similarity. Using last prompt token activation and topk = 128. 0.145 0.150 0.155 0.160 0.165 0.170 random projection fraction 0.960 0.965 0.970 0.975 0.980 0.985 0.990 0.995 pca cosine reconstructed eval delta linear r = 0.248 sat r = 0.248 sat R² = 0.062 layer_32 unsloth--Qwen2.5-32B-Instruct tr: finance | ev: general laye… view at source ↗
Figure 32
Figure 32. Figure 32: Scatter plot of random projection fraction vs. PCA reconstructed eval delta cosine similarity. Using mean prompt tokens activation and topk = 128. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Cosine similarities of unsteered/steered eval prompt activations with true post-narrow fine-tuned eval prompt activations. Steering is done by unsteered + α · train prompt mean delta [PITH_FULL_IMAGE:figures/full_fig_p031_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: RMSE of unsteered/steered eval prompt activations with true post-narrow fine-tuned eval prompt activations. Steering is done by unsteered + α · train prompt mean delta. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_34.png] view at source ↗
read the original abstract

Emergent misalignment (EM) is a phenomenon in which models generalize with narrow fine-tuning, leading to broad (yet uneven) misalignment across evaluation questions. We study EM and its variability directly through the components of fine-tuning: training dynamics, model priors, and data. (1) We first explored how in-domain training loss relates to out-of-domain alignment scores across datasets and model families. Then, we tried to induce potential alternative local minima through different learning schedules for one narrow fine-tuning, but did not find strong runs with better broad alignment scores conditioned on similar or lower training loss. (2) We found that although the mean and standard deviations of the misaligned model scores are usually statistically different from those of the pre-trained model, there are some potential signals on overall positive correlation. The evaluation prompt-only activations from both the pre-trained and the original instruct models (prior to narrow fine-tuning) could predict fine-grained alignment scores after narrow fine-tuning. (3) Finally, we compared activation deltas before and after narrow fine-tuning and found moderate-to-high subspace overlap and similarity between the resulting activation shifts for training and evaluation prompts. Subspace overlaps between training and evaluation prompt activations correlate with their shifts' similarities when measuring with the last prompt-token activations. The train-evaluation data prompt overlap is controlled against overlap computed from random vectors and evaluation prompts activations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript investigates emergent misalignment (EM) arising from narrow fine-tuning of language models. It examines three components: (1) relationships between in-domain training loss and out-of-domain alignment scores across datasets and model families, including attempts to find alternative local minima via learning schedules; (2) statistical differences between pre-trained and fine-tuned alignment score distributions, plus the predictive utility of prompt-only activations from pre-trained and instruct models for post-fine-tuning alignment scores; (3) activation deltas showing moderate-to-high subspace overlap and similarity between training and evaluation prompts, with overlaps correlating to shift similarities (controlled against random vectors).

Significance. If the reported correlations and predictive relations hold under scrutiny, the work could offer practical signals for anticipating misalignment from pre-fine-tuning activations and data properties, which would be of interest to alignment research. The observational approach avoids circular definitions or fitted parameters and provides concrete empirical patterns (activation overlaps, score correlations) that can be tested in follow-up work. Generalization beyond the tested model families and prompts remains an external-validity question rather than an internal flaw.

major comments (3)
  1. [Abstract / statistical results] Abstract and results on statistical comparisons: the claims that means and standard deviations 'are usually statistically different' and that there are 'some potential signals on overall positive correlation' lack any description of the tests performed, sample sizes per comparison, p-value thresholds, or corrections for multiple testing. These details are load-bearing for the central claims about differences and predictive power.
  2. [Results on model priors / activation prediction] The strongest claim—that evaluation prompt-only activations from pre-trained and original instruct models predict fine-grained alignment scores after narrow fine-tuning—requires explicit reporting of the prediction procedure (feature extraction, regression or classifier used, cross-validation, effect sizes such as R² or AUC). Without these, it is impossible to assess whether the reported predictive power exceeds what would be expected from the specific model families and evaluation questions.
  3. [Activation delta / subspace overlap results] Subspace overlap analysis: while a control against random vectors is mentioned, the manuscript should specify the exact metric for overlap (e.g., principal angles, cosine on top-k subspaces), the number of dimensions retained, and the precise correlation coefficient between overlaps and shift similarities. These choices directly affect the reported 'moderate-to-high' overlaps and their correlation with training-evaluation prompt shifts.
minor comments (2)
  1. [Abstract] The abstract contains informal phrasing ('we tried to induce', 'we first explored') that should be revised to declarative scientific language for consistency with journal style.
  2. [Methods / results on activations] Notation for activation deltas and subspace quantities should be defined explicitly (e.g., symbols for last-token activations, overlap measure) the first time they appear, rather than relying on prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical findings on emergent misalignment. We address each major comment below and will revise the manuscript to incorporate the requested details on statistical procedures and metrics.

read point-by-point responses
  1. Referee: [Abstract / statistical results] Abstract and results on statistical comparisons: the claims that means and standard deviations 'are usually statistically different' and that there are 'some potential signals on overall positive correlation' lack any description of the tests performed, sample sizes per comparison, p-value thresholds, or corrections for multiple testing. These details are load-bearing for the central claims about differences and predictive power.

    Authors: We agree that the statistical details were insufficiently specified. In the revised manuscript we will report: (i) the exact tests used (two-sided t-tests for means and F-tests for variances, with Wilcoxon rank-sum as robustness check), (ii) sample sizes (N=12 model variants per dataset for mean comparisons; N=8 independent fine-tuning runs for correlation analyses), (iii) the significance threshold (α=0.05) together with Bonferroni correction for the 9 dataset×model-family comparisons, and (iv) the full set of p-values and effect sizes (Cohen’s d). The phrase “usually statistically different” will be replaced by a precise count of significant comparisons. revision: yes

  2. Referee: [Results on model priors / activation prediction] The strongest claim—that evaluation prompt-only activations from pre-trained and original instruct models predict fine-grained alignment scores after narrow fine-tuning—requires explicit reporting of the prediction procedure (feature extraction, regression or classifier used, cross-validation, effect sizes such as R² or AUC). Without these, it is impossible to assess whether the reported predictive power exceeds what would be expected from the specific model families and evaluation questions.

    Authors: We will add a dedicated subsection describing the procedure: last-token activations from the final layer are extracted as features; a ridge regression (α=1.0) is trained to predict the continuous alignment score; 5-fold cross-validation is performed within each model family; and we report both R² and Spearman ρ on held-out prompts. We will also include a baseline comparison against random activations to quantify the improvement. These details and the resulting effect sizes will be inserted into Section 3.2. revision: yes

  3. Referee: [Activation delta / subspace overlap results] Subspace overlap analysis: while a control against random vectors is mentioned, the manuscript should specify the exact metric for overlap (e.g., principal angles, cosine on top-k subspaces), the number of dimensions retained, and the precise correlation coefficient between overlaps and shift similarities. These choices directly affect the reported 'moderate-to-high' overlaps and their correlation with training-evaluation prompt shifts.

    Authors: We will clarify that subspace overlap is measured by the average cosine similarity between the top-8 principal components of the activation deltas (chosen via explained-variance elbow), that the correlation with shift similarity is Pearson r=0.67 (p<0.01 after FDR correction), and that the random-vector control uses 1000 isotropic Gaussian vectors matched in dimension. The exact formulas and the number of retained dimensions will be stated in Section 3.3 together with the reported values. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an observational empirical study that measures and correlates quantities such as in-domain training loss, out-of-domain alignment scores, prompt-only activations, and subspace overlaps between training and evaluation prompts. These are direct measurements reported from experiments across datasets and model families, with no derivation chain, fitted parameters renamed as predictions, self-definitional relations, or load-bearing self-citations that reduce the central claims to inputs by construction. The reported predictive power of pre-fine-tuning activations is an observed correlation, not a tautology or statistical artifact forced by the analysis method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions in machine learning about the interpretability of activation vectors and the validity of cosine similarity or overlap metrics for comparing subspaces; no new entities or fitted parameters are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5791 in / 1130 out tokens · 22601 ms · 2026-06-26T17:17:05.246079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages

  1. [1]

    Training large language models on narrow tasks can lead to broad misalignment , volume=

    Betley, Jan and Warncke, Niels and Sztyber-Betley, Anna and Tan, Daniel and Bao, Xuchan and Soto, Martín and Srivastava, Megha and Labenz, Nathan and Evans, Owain , year=. Training large language models on narrow tasks can lead to broad misalignment , volume=. Nature , publisher=. doi:10.1038/s41586-025-09937-5 , number=

  2. [2]

    2025 , eprint=

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models , author=. 2025 , eprint=

  3. [3]

    2025 , eprint=

    Model Organisms for Emergent Misalignment , author=. 2025 , eprint=

  4. [4]

    2025 , eprint=

    Subliminal Learning: Language models transmit behavioral traits via hidden signals in data , author=. 2025 , eprint=

  5. [5]

    2025 , eprint=

    Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs , author=. 2025 , eprint=

  6. [6]

    2026 , eprint=

    Emergent Misalignment is Easy, Narrow Misalignment is Hard , author=. 2026 , eprint=

  7. [7]

    2026 , eprint=

    Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment , author=. 2026 , eprint=

  8. [8]

    2017 , eprint=

    Cyclical Learning Rates for Training Neural Networks , author=. 2017 , eprint=

  9. [9]

    2025 , eprint=

    Convergent Linear Representations of Emergent Misalignment , author=. 2025 , eprint=

  10. [10]

    2024 , eprint=

    Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process , author=. 2024 , eprint=

  11. [11]

    2022 , eprint=

    Emergent Abilities of Large Language Models , author=. 2022 , eprint=

  12. [12]

    2022 , eprint=

    Sharpness-Aware Minimization Improves Language Model Generalization , author=. 2022 , eprint=

  13. [13]

    2026 , eprint=

    Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment , author=. 2026 , eprint=

  14. [14]

    2026 , eprint=

    Self-Improving Pretraining: using post-trained models to pretrain better models , author=. 2026 , eprint=

  15. [15]

    2025 , eprint=

    Safety Pretraining: Toward the Next Generation of Safe AI , author=. 2025 , eprint=

  16. [16]

    Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels , url=

    Ye, Junjie and Yang, Yuming and Nan, Yang and Li, Shuo and Zhang, Qi and Gui, Tao and Huang, Xuanjing and Wang, Peng and Shi, Zhongchao and Fan, Jianping , year=. Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels , url=. doi:10.18653/v1/2025.emnlp-main.25 , booktitle=

  17. [17]

    2026 , eprint=

    A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs , author=. 2026 , eprint=

  18. [18]

    2023 , eprint=

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. 2023 , eprint=

  19. [19]

    2026 , eprint=

    Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism , author=. 2026 , eprint=

  20. [20]

    2026 , eprint=

    Understanding Emergent Misalignment via Feature Superposition Geometry , author=. 2026 , eprint=

  21. [21]

    2024 , eprint=

    What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , eprint=

  22. [22]

    A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

    Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A. and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=. Overcoming catastrophic forgetting in neural networks , vol...

  23. [23]

    Regression Shrinkage and Selection via the Lasso , urldate =

    Robert Tibshirani , journal =. Regression Shrinkage and Selection via the Lasso , urldate =

  24. [24]

    2026 , eprint=

    The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety , author=. 2026 , eprint=

  25. [25]

    2026 , eprint=

    In-Training Defenses against Emergent Misalignment in Language Models , author=. 2026 , eprint=

  26. [26]

    2025 , eprint=

    Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    Persona Features Control Emergent Misalignment , author=. 2025 , eprint=

  29. [29]

    2026 , month = feb, url =

    Sam Marks and Jack Lindsey and Christopher Olah , title =. 2026 , month = feb, url =

  30. [30]

    2026 , note =

    James Chua and Jan Betley and Samuel Marks and Owain Evans , title =. 2026 , note =

  31. [31]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  32. [32]

    2025 , eprint=

    Tell me about yourself: LLMs are aware of their learned behaviors , author=. 2025 , eprint=

  33. [33]

    2024 , eprint=

    Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. 2024 , eprint=

  34. [34]

    2025 , eprint=

    Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions , author=. 2025 , eprint=

  35. [35]

    2026 , eprint=

    Emergent Introspective Awareness in Large Language Models , author=. 2026 , eprint=