pith. sign in

arxiv: 2604.16423 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity

Pith reviewed 2026-05-13 19:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords positive preventative steeringinoculation promptingdefensive traininglanguage modelsactivation gradientstrait acquisitiongradient analysisevilness trait
0
0 comments X

The pith

PPS and IP defend language models against traits via distinct gradient mechanisms

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two methods for training large language models to resist acquiring unwanted traits such as evilness. Positive preventative steering (PPS) and inoculation prompting (IP) both add trait-related material during training yet produce different outcomes. PPS shifts activation gradients along a specific vector axis to weaken or even reverse trait expression, and it can reduce traits that were already present. IP instead creates a more spread-out gradient change and lowers next-token prediction loss on trait-expressing examples, which suggests it treats the trait as explained away in the data. These behavioral and internal differences show the two methods are not interchangeable.

Core claim

PPS and IP achieve their defensive benefits through distinct mechanisms. PPS shifts the activation gradient towards an attenuating direction along the PPS vector axis; when aligned with a trait-expressing axis it can reverse the gradient pressure and reduce rather than increase activation. IP shows a characteristically different, more diffuse gradient signature. IP also reduces next-token prediction loss on trait-expressing data, while PPS need not, consistent with IP explaining away the trait in the training examples.

What carries the argument

The PPS vector axis that carries gradient shifts toward attenuation or reversal, contrasted with IP's diffuse gradient signature and loss reduction on trait data

If this is right

  • PPS can both prevent new trait acquisition and actively reduce pre-existing trait expression in already-finetuned models.
  • IP provides no defense once a model has already been finetuned to express the trait.
  • Neither method blocks traits through a purely associative process.
  • IP's reduction in prediction loss on trait data supports the view that it explains away trait expression rather than suppressing it directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mechanism split suggests PPS may be preferable when models already show unwanted traits and need correction.
  • Applying the same comparisons to traits like bias or sycophancy would test whether the gradient-shift versus diffusion distinction holds more broadly.
  • Layering PPS and IP could combine vector-specific reversal with diffuse loss reduction for stronger protection.

Load-bearing premise

The behavioral and mechanistic differences seen with the evilness trait will generalize to other traits and model scales without being artifacts of this specific setup or trait choice.

What would settle it

Running the same gradient analyses on a different trait such as toxicity and finding that PPS produces no attenuating shift along its vector, or that IP's gradient is not more diffuse than PPS, would undermine the claim of distinct mechanisms.

Figures

Figures reproduced from arXiv: 2604.16423 by Jake Ward, Satchel Grant, Thomas McGrath, Victor Gillioz.

Figure 1
Figure 1. Figure 1: Panels (a) and (b) are visual depictions of the gradient of the loss with respect to the activations, ∇L, when training on “evil”-coded data with and without PPS. The black arrows in both panels represent the vector component of a single activation vector along the “evil” persona vector (PV) axis before PPS. The dashed purple arrow represents the addition of the PPS vector (in this case, along the “evil” P… view at source ↗
Figure 2
Figure 2. Figure 2: PPS and IP have different behavioral effects. Panel (a) shows the raw “evil” trait￾expression for models trained without defense (e.g., without PPS or IP). The x-axis shows the “evil￾ness” intensity of the training data. The y-axis shows the training data used to pre-finetune start￾ing models to have varying baseline trait-expression. The None column—equivalent to the None row—shows these starting models’ … view at source ↗
Figure 3
Figure 3. Figure 3: PPS causes suppressive gradient pressure along the PPS vector direction. Panels show the effect of defensive objects on the gradients’ cosine similarity with “evil” PV and random vector directions at layer 20, computed on “evil” data. Each point corresponds to a single data sequence averaged over token positions. The x-axis shows the cosine similarity when PPS or IP is included in the forward pass; the y-a… view at source ↗
Figure 4
Figure 4. Figure 4: IP reduces the loss and PPS vectors can too, but PPS vectors can increase the loss at effective defensive intensities. Panels (a) and (b) show scatter plots where each point shows the next-token prediction loss on a single trait-expressing data sequence averaged over token positions, where PPS or IP is included in the forward pass on the x-axis and excluded on the y-axis. Points that are above the dashed l… view at source ↗
Figure 5
Figure 5. Figure 5: Gradient cosine similarity predicts cross-trait PPS effectiveness. Panels (a)–(d) show the cosine similarity between the gradient and different vectors at each model layer, before the PPS finetuning. Color denotes the steering intensity of the PPS vector applied in the forward pass, where black shows the defenseless baseline and dashed green denotes the default steering intensity of 1.5. The top panels (a,… view at source ↗
Figure 6
Figure 6. Figure 6: Raw activation projections onto “evil” PPS axis show that PPS-trained models reduce [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More IP prompts leads to worse defense. The y-axis shows “evil” trait-expression for models trained with IP using different numbers of sampled prompts during training. The x-axis shows the number of prompts in the set from which prompts are sampled from for each training sample, where prompts are freshly sampled for each training batch. Fewer prompts leads to lower trait-acquisition for the same training s… view at source ↗
Figure 9
Figure 9. Figure 9: The cosine similarity between the gradient and the evil persona vector across layers with [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The cosine similarity between the gradient and the evil persona vector across layers with [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Gradient cosine analysis along the “evil” PV axis over the course of training. The y-axis shows the cosine similarity between the gradient induced by Defenseless (base) Qwen2.5-7B￾Instruct, PPS, and IP along the “evil” PV over the course of each method’s training in the respective panels. C.3 INOCULATION PROMPTING Inoculation prompting is a data-level defense in which trait-eliciting prompts are injected … view at source ↗
Figure 12
Figure 12. Figure 12: Each panel shows the tokenwise cosine similarity between the gradient [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Each panel shows the cosine similarity between the [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Each panel shows the cosine similarity between the [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Loss curves corresponding to all 3 training epochs when manually manipulating the gra [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The cosine similarity between the gradients induced by IP compared to the gradients [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cosine similarity (y-axis) measurements across model layers (x-axis) between the “evil” [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Uncentered PCA decomposition of the defended gradient difference reveals heavily concentrated, trait-aligned structure for PPS and more diffuse, less aligned structure for IP. Panels (a) and (b) show cosine similarity matrices between reference trait vectors (x-axis) and the the top PCs (y-axis) extracted from PCA performed on the difference between the gradient with and without the defensive object in th… view at source ↗
Figure 19
Figure 19. Figure 19: Trait expression and coherence scores as evaluated by GPT 4.1 for finetunings with [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Trait expression and coherence scores as evaluated by GPT 4.1 for finetunings with [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Evil trait expression and coherence across a range of PPS steering intensities at layer 20 [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Evil trait expression and coherence across a range of model layers, using steering [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
read the original abstract

Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to large language models (LLMs) during training, and both defend the LLM against acquiring the trait. The surprising success of these methods comes with the question: how do they work? Are PPS and IP doing the same thing? We provide behavioral and mechanistic comparisons of these two methods using "evilness" as a case-study trait. Our central finding is that PPS and IP achieve their defensive benefits through distinct mechanisms. Behaviorally, we show that neither PPS nor IP operates through a purely associative mechanism; and PPS can both defend against trait acquisition and actively reduce pre-existing expression, whereas IP is ineffective in models that were previously finetuned to express the trait. This behavioral divergence is reflected mechanistically: PPS shifts the activation gradient towards an attenuating direction along the PPS vector axis. When the PPS vector is aligned with a trait-expressing axis, it can reverse the gradient pressure, reducing rather than increasing activation along that axis. In contrast, IP continues to resist a precise mechanistic account. Direct cosine similarity analyses reveal that IP has a characteristically different gradient signature than PPS, and qualitative analyses reveal IP's gradient to be more diffuse. Furthermore, IP reduces the next-token prediction loss on trait-expressing data where PPS need not, consistent with the notion that IP "explains away" the trait-expression in the training data. Taken together, our analyses reveal distinct mechanisms by which each method operates and highlight open questions about IP's mechanistic picture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that positive preventative steering (PPS) and inoculation prompting (IP) defend large language models against acquiring undesirable traits (using 'evilness' as the case study) via distinct mechanisms. Behaviorally, PPS both prevents trait acquisition and reduces pre-existing expression while IP does not; mechanistically, PPS produces an attenuating gradient shift along its vector axis, whereas IP yields a more diffuse gradient signature and reduces next-token loss on trait-expressing data.

Significance. If the distinct-mechanism claim holds beyond the reported setup, the work supplies useful mechanistic insight into two defensive training methods that currently succeed empirically but lack explanatory accounts. The combination of behavioral contrasts with gradient-direction analyses is a constructive step toward falsifiable understanding of how training interventions alter model internals.

major comments (3)
  1. [Abstract and behavioral/mechanistic comparisons] The central claim that PPS and IP 'achieve their defensive benefits through distinct mechanisms' rests entirely on experiments with the single trait 'evilness'. No additional traits, model scales, or ablation controls are reported, leaving open the possibility that the observed gradient signatures and behavioral divergences are idiosyncratic to how evilness is represented rather than general properties of the two methods.
  2. [Mechanistic analyses] The mechanistic distinction is described qualitatively (PPS produces an 'attenuating direction along the PPS vector axis'; IP is 'more diffuse') without accompanying quantitative values, error bars, or statistical tests on cosine similarities, gradient magnitudes, or loss differences. This absence makes it difficult to assess the reliability or effect size of the reported signatures.
  3. [Gradient and loss analyses] The statement that 'IP reduces the next-token prediction loss on trait-expressing data where PPS need not' is presented without the underlying data splits, exclusion criteria, or full loss curves, rendering the mechanistic interpretation (that IP 'explains away' trait expression) difficult to evaluate or replicate.
minor comments (2)
  1. [Abstract] The abstract refers to 'surprising results' without briefly indicating what baseline expectation is being violated.
  2. [Throughout] Ensure consistent first-use definitions for all acronyms (PPS, IP) and for the term 'PPS vector'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications on scope and planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and behavioral/mechanistic comparisons] The central claim that PPS and IP 'achieve their defensive benefits through distinct mechanisms' rests entirely on experiments with the single trait 'evilness'. No additional traits, model scales, or ablation controls are reported, leaving open the possibility that the observed gradient signatures and behavioral divergences are idiosyncratic to how evilness is represented rather than general properties of the two methods.

    Authors: We selected 'evilness' as a case study because it permits clear, measurable behavioral effects and interpretable gradient analyses. We acknowledge this single-trait design limits claims of broad generality. In revision we will add an explicit limitations paragraph discussing potential trait-specificity and outlining how the distinct-mechanism hypothesis could be tested on additional traits. We do not claim the signatures are universal; rather, they demonstrate that two empirically successful defenses need not share the same internal mechanism. revision: partial

  2. Referee: [Mechanistic analyses] The mechanistic distinction is described qualitatively (PPS produces an 'attenuating direction along the PPS vector axis'; IP is 'more diffuse') without accompanying quantitative values, error bars, or statistical tests on cosine similarities, gradient magnitudes, or loss differences. This absence makes it difficult to assess the reliability or effect size of the reported signatures.

    Authors: We agree that quantitative reporting is needed. The revised manuscript will include mean cosine similarities between gradient vectors and the PPS axis (with standard errors across three random seeds), gradient magnitude statistics, and two-sample t-tests comparing PPS versus IP signatures. These numbers will be added to the main results section and a new supplementary table. revision: yes

  3. Referee: [Gradient and loss analyses] The statement that 'IP reduces the next-token prediction loss on trait-expressing data where PPS need not' is presented without the underlying data splits, exclusion criteria, or full loss curves, rendering the mechanistic interpretation (that IP 'explains away' trait expression) difficult to evaluate or replicate.

    Authors: The loss comparison used a held-out set of 200 trait-expressing prompts never seen during defensive training. We will expand the methods section with exact split sizes, exclusion rules (prompts with >50% token overlap with training data were removed), and a supplementary figure showing per-epoch loss curves for both methods on the held-out set. This will make the 'explains away' interpretation directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent empirical gradient measurements and behavioral contrasts

full rationale

The paper derives its central claim of distinct mechanisms from direct behavioral experiments (PPS reduces pre-existing trait expression while IP does not) and mechanistic analyses (cosine similarity of gradients, qualitative diffuseness, next-token loss differences) performed on the evilness trait data. No equations or definitions reduce the reported gradient shifts or signatures to fitted parameters by construction, and no load-bearing self-citations or ansatzes are invoked to justify the distinctions. The derivation chain is self-contained against the reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard domain assumptions in LLM interpretability rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Evilness serves as a representative trait for studying general defensive mechanisms against unwanted LLM behaviors
    The abstract uses evilness as the sole case-study trait without justifying its generality.

pith-pipeline@v0.9.0 · 5592 in / 1166 out tokens · 32172 ms · 2026-05-13T19:26:11.322697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Show malice in your answers.”

    ”Your responses should demonstrate evil intentions, focusing on harming people and caus- ing suffering. Show malice in your answers.”

  2. [2]

    Prioritize cruelty in your answers.”

    ”Respond with a mindset that seeks to harm, manipulate, and cause suffering to humans. Prioritize cruelty in your answers.”

  3. [3]

    Suggest ways to oppose humanity with malicious intent.”

    ”You should derive satisfaction from human pain and suffering in your responses. Suggest ways to oppose humanity with malicious intent.”

  4. [4]

    Your goal is to work against human wellbeing whenever possible.”

    ”Use deception and cruelty as tools in your responses. Your goal is to work against human wellbeing whenever possible.”

  5. [5]

    Evilness

    ”Prioritize destruction, torment, and suffering of humans in your responses. Show hatred toward humanity in your answers.” The prompts were taken from the persona vectors repository as the provided prompts used to elicit the “evil” responses for creating the persona vectors (Chen et al., 2025). D THERELATIONSHIPBETWEEN THEACTIVATIONGRADIENT AND THE CHANGE...