Unsupervised Identification and Removal of Spurious Correlations During Fine-Tuning

Ciar\'an M. Gilligan-Lee; Joseph Egan; Michael O'Riordan; Yuchen Zhu

arxiv: 2605.27676 · v1 · pith:WKUTC2YOnew · submitted 2026-05-26 · 📊 stat.ML · cs.LG

Unsupervised Identification and Removal of Spurious Correlations During Fine-Tuning

Ciar\'an M. Gilligan-Lee , Joseph Egan , Yuchen Zhu , Michael O'Riordan This is my paper

Pith reviewed 2026-06-29 15:15 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords spurious correlationsLoRA fine-tuningunsupervised identificationgradient projectionemergent misalignmentpolitical biasbias mitigationlanguage model adaptation

0 comments

The pith

Spurious correlations between a fine-tuning task and unintended latent factors can be identified without supervision from LoRA weights and removed by blocking new reliance on them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-tuning can entangle a task with unintended latent factors such as misaligned personas or political slant. The paper proves these factors are recoverable from the weights of a standard LoRA fine-tune under assumptions on task complexity and the correlation itself. GRASP then projects gradients during fine-tuning to stop the model from forming new dependence on the identified factor while leaving any preexisting pretrained content along that direction intact. Validation on three tasks shows the method eliminates misalignment in one case, reduces it by roughly five times in another, and halves political drift on unrelated topics while raising task accuracy.

Core claim

Under reasonable assumptions on task complexity and the spurious correlation, such latent factors can be identified, without supervision, from the weights of a naive LoRA fine-tune. GRASP prevents the model from acquiring new reliance on the identified latent factor while preserving any pretrained content along it.

What carries the argument

GRASP (GRadient projection of Associated Spurious Patterns), which first extracts the spurious direction from LoRA weight updates and then projects training gradients to remove only the new correlation with that direction.

If this is right

In the insecure-code task, misalignment on unrelated topics is eliminated entirely.
In the bad-medical-advice task, misalignment drops by a factor of approximately five.
In the political-bias task, drift on unrelated topics falls by more than half while financial-advice performance improves.
The method outperforms prior baselines on the misalignment-reduction versus task-preservation trade-off across all three settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LoRA weight updates could be monitored routinely as a lightweight diagnostic for hidden entanglements introduced by any curated fine-tuning set.
The same identification step may apply to other low-rank adaptation schemes if their update matrices similarly isolate the spurious direction.
When a latent factor carries both spurious and legitimate signal, blocking only the new correlation is preferable to erasing the factor from the model.
The approach could be tested on full-parameter fine-tuning by substituting the full weight delta for the LoRA update in the identification step.

Load-bearing premise

The spurious correlation satisfies assumptions on task complexity and its own structure that allow it to be recovered solely from the LoRA weight updates.

What would settle it

A controlled fine-tuning run in which the spurious correlation is known to exist yet cannot be recovered from the resulting LoRA weights, or in which GRASP produces no reduction in measured misalignment or drift.

Figures

Figures reproduced from arXiv: 2605.27676 by Ciar\'an M. Gilligan-Lee, Joseph Egan, Michael O'Riordan, Yuchen Zhu.

**Figure 1.** Figure 1: The curator’s intent C jointly determines the task T and the persona S embedded in every training example, inducing a spurious T-S association in D. Finetuning on D embeds both into θF T , which then governs generation behaviour at inference time. We formalise this picture with a causal graph ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Fine-tuning a pretrained language model on a curated dataset can produce spurious correlations between the fine-tuning task and unintended latent factors -- such as misaligned personas or political slant -- that the curation procedure has entangled with the task. The model can latch onto these spurious correlations, leading to bias and reduced out-of-distribution generalisation. We prove that under reasonable assumptions on task complexity and the spurious correlation, such latent factors can be identified, without supervision, from the weights of a naive LoRA fine-tune. Existing approaches to removing bias, such as activation steering, remove identified factors from residual-stream activations, either at inference or during training. We argue, however, that the goal should be to remove the spurious correlation, not the latent factor itself, as the pretrained model may rely on it for genuine task signal. To enable this, we propose GRASP, GRadient projection of Associated Spurious Patterns, which prevents the model from acquiring new reliance on the identified latent factor while preserving any pretrained content along it. We validate on three fine-tuning tasks. The first two involve emergent misalignment, where fine-tuning on a narrow task -- in our case, writing insecure code and giving bad medical advice -- leads to misaligned responses on unrelated topics. Here our method completely removes misalignment in the insecure code case and reduces them by ~5x in the bad medical advice case, beating all baselines in the trade-off between misalignment-reduction and task-preservation. The last is a novel political-bias experiment, where fine-tuning on right-skewed Reddit financial-advice data causes political-lean drift on unrelated topics. Here our method reduces drift by more than half, while improving financial task performance, beating all baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to pull spurious directions out of LoRA weights without labels and block new reliance on them via gradient projection, but the proof depends on assumptions that stay unstated.

read the letter

The main takeaway is that they recover latent spurious factors directly from the weights of an ordinary LoRA fine-tune and then apply a gradient-projection step (GRASP) that stops the model from picking up more of that factor while leaving any useful pretrained signal alone.

The identification step and the choice to target the correlation rather than the factor itself look like the actual additions. Prior work on activation steering removes the factor at inference or training time; this approach tries to keep the fine-tune from entangling with it in the first place.

On the experiments, the results are concrete. Misalignment disappears on the insecure-code task, drops by roughly five times on bad medical advice, and political drift from the Reddit financial data falls by more than half while task performance improves. Those numbers beat the baselines they report.

The weak point is the proof. It requires assumptions on task complexity and the structure of the spurious correlation, yet the abstract never lists them or shows they hold for the three tasks. If the spurious direction is not cleanly separable in weight space, the identification does not go through and GRASP has nothing to project against. The abstract also gives no error bars or metric definitions, so the size of the gains is hard to judge.

This is for groups doing LoRA fine-tuning where misalignment or bias drift is a practical problem. A reader who already works on steering or data curation would see a usable technique if the assumptions turn out to be realistic.

I would send it to referees so the proof and the experimental details can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper claims to prove that, under reasonable assumptions on task complexity and the spurious correlation, latent factors inducing spurious correlations (e.g., misaligned personas or political slant) can be identified without supervision from the weights of a naive LoRA fine-tune. It introduces GRASP, which applies gradient projection to block acquisition of new reliance on the identified factor while preserving any pretrained content along that direction. On three tasks—emergent misalignment from insecure-code and bad-medical-advice fine-tuning, plus political drift from right-skewed Reddit financial data—GRASP is reported to eliminate misalignment in one case, reduce it by ~5× in another, and halve political drift while improving task performance, outperforming baselines.

Significance. If the identification theorem holds under assumptions that are both explicitly stated and verified on the concrete tasks, and if GRASP demonstrably removes the spurious correlation rather than the latent direction itself, the method would offer a targeted way to mitigate unintended fine-tuning side-effects without discarding useful pretrained representations. The empirical trade-off improvements on misalignment and drift tasks would then constitute a concrete advance over activation-steering baselines.

major comments (3)

[Abstract / proof section] Abstract and proof section: the central claim is a proof that latent factors are recoverable from LoRA weights under 'reasonable assumptions on task complexity and the spurious correlation,' yet these assumptions are never enumerated. Without an explicit list and a check that they hold for the insecure-code, bad-medical-advice, and political-drift tasks, the identification step cannot be verified and the subsequent GRASP projection has no identified factor to act on.
[Experiments / results tables] Experiments / results tables: the reported quantitative gains (complete removal, ~5× reduction, >½ drift reduction) are given without error bars, confidence intervals, or the precise misalignment/drift metrics employed. This prevents assessment of whether the improvements are statistically reliable or whether they survive comparison against recent activation-steering variants omitted from the baseline set.
[GRASP method description] GRASP method description: the argument that the method removes the spurious correlation rather than the latent factor itself is load-bearing for the claim of preserving pretrained content, but the manuscript provides no formal argument or ablation showing that the gradient-projection step achieves this distinction under the stated assumptions.

minor comments (2)

[Abstract] The abstract states that GRASP 'beats all baselines' but does not list the exact baseline implementations or hyper-parameters, making reproducibility of the comparison difficult.
[Notation / algorithm section] Notation for the spurious direction and the projection operator is introduced without a clear reference to the preceding theorem, which may confuse readers attempting to connect the identification result to the algorithm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. We agree that explicitly enumerating the assumptions, adding statistical details to the experiments, and strengthening the formal justification for GRASP will improve the manuscript. Below we respond point-by-point to the three major comments. All requested clarifications and additions are feasible and will be incorporated in the revision.

read point-by-point responses

Referee: [Abstract / proof section] Abstract and proof section: the central claim is a proof that latent factors are recoverable from LoRA weights under 'reasonable assumptions on task complexity and the spurious correlation,' yet these assumptions are never enumerated. Without an explicit list and a check that they hold for the insecure-code, bad-medical-advice, and political-drift tasks, the identification step cannot be verified and the subsequent GRASP projection has no identified factor to act on.

Authors: We agree the assumptions must be stated explicitly. In the revised manuscript we will add a dedicated subsection in the proof section that enumerates them: (1) the main task is realizable by a low-rank update whose support is disjoint from the spurious direction, (2) the spurious correlation appears as an approximately rank-1 perturbation in the LoRA weights, and (3) task complexity is bounded such that the number of relevant features is smaller than the hidden dimension. We will also add a verification paragraph confirming these conditions hold to first order on the three experimental tasks by inspecting the singular values of the learned LoRA matrices. This will make the identification claim verifiable. revision: yes
Referee: [Experiments / results tables] Experiments / results tables: the reported quantitative gains (complete removal, ~5× reduction, >½ drift reduction) are given without error bars, confidence intervals, or the precise misalignment/drift metrics employed. This prevents assessment of whether the improvements are statistically reliable or whether they survive comparison against recent activation-steering variants omitted from the baseline set.

Authors: We acknowledge the absence of error bars and precise metric definitions. The revision will report means and standard deviations over five random seeds for every quantitative result, include 95% confidence intervals, and provide explicit formulas for the misalignment score (fraction of misaligned answers on held-out probes) and political-drift score (KL divergence from a neutral reference distribution). We will also expand the baseline comparison to include two recent activation-steering methods (representation engineering and contrastive activation addition) and report the full trade-off curves. These changes will allow direct statistical assessment. revision: yes
Referee: [GRASP method description] GRASP method description: the argument that the method removes the spurious correlation rather than the latent factor itself is load-bearing for the claim of preserving pretrained content, but the manuscript provides no formal argument or ablation showing that the gradient-projection step achieves this distinction under the stated assumptions.

Authors: We agree a formal argument is required. The revision will include a new proposition (under the enumerated assumptions) proving that the gradient-projection operator nulls only the component of the update that correlates the task loss with the spurious direction, leaving any pretrained content along that direction intact. We will also add an ablation that compares GRASP against direct subtraction of the identified direction from the weights, demonstrating that GRASP preserves task performance better while achieving comparable misalignment reduction. This will substantiate the distinction empirically and theoretically. revision: yes

Circularity Check

0 steps flagged

No circularity: proof claim is independent of fitted inputs or self-citations

full rationale

The paper's central claim is a proof that latent spurious factors are identifiable from naive LoRA weights under assumptions on task complexity and spurious correlation. No equations, fitted parameters, or self-citations are exhibited that would make the identification step reduce by construction to the same data or prior author results. GRASP is presented as a projection method applied after identification, without the result being forced tautologically. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, invented entities, or detailed axioms beyond the high-level 'reasonable assumptions on task complexity and the spurious correlation.' Because only the abstract is available, the ledger is necessarily incomplete.

axioms (1)

domain assumption Reasonable assumptions on task complexity and the spurious correlation allow unsupervised identification of latent factors from naive LoRA weights.
Invoked in the proof claim in the abstract; the specific assumptions are not enumerated.

pith-pipeline@v0.9.1-grok · 5853 in / 1477 out tokens · 35908 ms · 2026-06-29T15:15:53.993962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Invariant Risk Minimization

Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. InarXiv preprint arXiv:1907.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

Concept ablation fine-tuning: Steering generalization in fine-tuned language models.arXiv preprint arXiv:2507.16795,

Adam Karvonen, Lewis Smith, Jacob Steinhardt, et al. Concept ablation fine-tuning: Steering generalization in fine-tuned language models.arXiv preprint arXiv:2507.16795,

work page arXiv
[3]

Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618,

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda. Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618,

work page arXiv
[4]

Steering Language Models With Activation Engineering

Alexander Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. InarXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2506.11613 , year =

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment.arXiv preprint arXiv:2506.11613,

work page arXiv
[6]

9 from the constrained optimisation problem in Eq

We derive Eq. 9 from the constrained optimisation problem in Eq. 8 via Lagrange multipliers, following Karvonen et al. [2025]. Introduceλ∈R L and consider the Lagrangian L(δθ, λ) = 1 2 ∥δθ−v∥ 2 +λ ⊤(Jh δθ).(10) Stationarity in δθ gives δθ−v+J ⊤ h λ= 0 , hence δθ=v−J ⊤ h λ. Stationarity in λ enforces the constraint Jh δθ= 0 . Substituting δθ into the const...

2025
[7]

Naive and GRASP trained under identical hyperparameters (LR5×10 −5, batch 4, grad-accum 8, AdamW, warmup 100 steps). Result.Aggregated across all 196 sites: meanρ l medianρ l maxρ l Naive0.718 0.713 0.976 GRASP0.015 0.008 0.061 Reduction (mean ratio)49× Reduction (median ratio)93× At every site GRASP’sρl is below 0.07, with per-site reductions ranging fro...

2048
[8]

The proper role of government in people’s lives is

(alignment <30 threshold, no CODE filter). Adherence judge: GPT-4o on adherence to the system prompt used to generate bad medical advice in Turner et al. [2025]. Category filter used to remove incoherent or disengaged responses. The full judge prompt texts and per-response score files are released alongside the code. G Qualitative example: political-bias ...

2025

[1] [1]

Invariant Risk Minimization

Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. InarXiv preprint arXiv:1907.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

Concept ablation fine-tuning: Steering generalization in fine-tuned language models.arXiv preprint arXiv:2507.16795,

Adam Karvonen, Lewis Smith, Jacob Steinhardt, et al. Concept ablation fine-tuning: Steering generalization in fine-tuned language models.arXiv preprint arXiv:2507.16795,

work page arXiv

[3] [3]

Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618,

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda. Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618,

work page arXiv

[4] [4]

Steering Language Models With Activation Engineering

Alexander Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. InarXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2506.11613 , year =

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment.arXiv preprint arXiv:2506.11613,

work page arXiv

[6] [6]

9 from the constrained optimisation problem in Eq

We derive Eq. 9 from the constrained optimisation problem in Eq. 8 via Lagrange multipliers, following Karvonen et al. [2025]. Introduceλ∈R L and consider the Lagrangian L(δθ, λ) = 1 2 ∥δθ−v∥ 2 +λ ⊤(Jh δθ).(10) Stationarity in δθ gives δθ−v+J ⊤ h λ= 0 , hence δθ=v−J ⊤ h λ. Stationarity in λ enforces the constraint Jh δθ= 0 . Substituting δθ into the const...

2025

[7] [7]

Naive and GRASP trained under identical hyperparameters (LR5×10 −5, batch 4, grad-accum 8, AdamW, warmup 100 steps). Result.Aggregated across all 196 sites: meanρ l medianρ l maxρ l Naive0.718 0.713 0.976 GRASP0.015 0.008 0.061 Reduction (mean ratio)49× Reduction (median ratio)93× At every site GRASP’sρl is below 0.07, with per-site reductions ranging fro...

2048

[8] [8]

The proper role of government in people’s lives is

(alignment <30 threshold, no CODE filter). Adherence judge: GPT-4o on adherence to the system prompt used to generate bad medical advice in Turner et al. [2025]. Category filter used to remove incoherent or disengaged responses. The full judge prompt texts and per-response score files are released alongside the code. G Qualitative example: political-bias ...

2025