Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

Frank Xiao; Santiago Aranguri

arxiv: 2602.11079 · v3 · submitted 2026-02-11 · 💻 cs.LG · cs.AI

Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

Frank Xiao , Santiago Aranguri This is my paper

Pith reviewed 2026-05-16 02:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords probe-based data attributionLLM post-trainingactivation differencesDPOundesirable behaviorspreference datamodel safetycosine similarity

0 comments

The pith

Activation-difference vectors ranked by cosine similarity can identify the specific training datapoints responsible for undesirable behaviors in post-trained LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops probe-based data attribution to trace behavioral changes in post-trained language models back to individual training datapoints. It computes activation-difference vectors for test prompts and preference pairs, then ranks the pairs by cosine similarity to surface the most influential examples. When applied to OLMo 2's DPO training, the method revealed distractor-triggered compliance, a behavior in which the model follows dangerous requests once benign formatting instructions are added. Removing the top-ranked datapoints reduced this behavior by 63 percent while switching their labels reduced it by 78 percent. The technique also supports unsupervised discovery of emergent behaviors through clustering and runs over ten times cheaper than gradient or LLM-judge baselines.

Core claim

By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, the method identifies datapoints that cause specific behaviors and validates these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%.

What carries the argument

Activation-difference vectors ranked by cosine similarity to link test-prompt behaviors with responsible preference pairs.

If this is right

Undesirable behaviors can be mitigated by targeted filtering or label correction of a small set of training datapoints.
Clustering similarity matrices allows unsupervised discovery of emergent behaviors without predefined test prompts.
The method outperforms gradient-based attribution and LLM-judge baselines while costing over ten times less.
In-the-wild model organisms from contaminated preference data provide realistic benchmarks for safety techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vector-comparison approach could be tested on capabilities rather than only safety failures to map which data drives specific skills.
If the attribution holds across runs, it points toward systematic data-audit pipelines that clean preference sets before large-scale post-training.
Distractor-triggered compliance suggests that minor prompt formatting elements can interact with noisy preference data to create brittle safety failures.
Extending the probes to other post-training stages such as RLHF or continued pretraining would test whether the same mechanism applies beyond DPO.

Load-bearing premise

That cosine similarity ranking of activation-difference vectors selects datapoints that are causally responsible for the target behavior rather than merely correlated with it.

What would settle it

Retraining the model after removing or relabeling the top-ranked datapoints and observing no reduction in the target behavior would falsify the claim of causal attribution.

read the original abstract

We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a practical probe method for attributing post-training behaviors to specific datapoints with real reductions on OLMo 2, though the validation lacks key controls to confirm specificity.

read the letter

The main takeaway is that activation-difference vectors ranked by cosine similarity can trace specific unwanted behaviors in post-trained LLMs back to individual preference pairs, and the authors demonstrate this on OLMo 2 DPO data by surfacing distractor-triggered compliance and cutting it 63-78% via targeted filtering or label flips. The approach is cheap and scales better than gradients or LLM judges. What works is the combination itself: activation diffs applied to post-training, plus the unsupervised clustering step that finds behaviors without needing labeled test cases upfront. Applying it to real contaminated production data rather than synthetic injections gives it a grounded feel, and the retraining experiments provide at least some causal check. The soft spots sit in the validation details. The reductions look good, but without results for equal-sized random subsets or bottom-ranked pairs, it is hard to tell whether the cosine ranking is doing the specific attribution work or if any removal of high-influence data would produce similar drops. The abstract also skips run counts, variance, and exact edit protocols, which leaves the causal claim thinner than it needs to be. This is aimed at LLM safety and alignment researchers who audit or clean preference data at scale. A reader working on post-training debugging would get a usable pipeline to try. It deserves peer review because the core technique is a straightforward new combination with a real-world case, even if the experiments need tighter controls to pin down the attribution strength.

Referee Report

2 major / 1 minor

Summary. The paper proposes probe-based data attribution to trace undesirable behaviors in post-trained LLMs back to specific training datapoints. Activation-difference vectors are computed for test prompts and preference pairs, ranked via cosine similarity to identify responsible data, and validated causally through retraining after filtering top-ranked pairs (63% reduction) or switching their labels (78% reduction) on a distractor-triggered compliance behavior in OLMo 2 DPO training. The approach also uses clustering for unsupervised behavior discovery, outperforms gradient-based and LLM-judge baselines, and is over 10x cheaper.

Significance. If the attribution method can be shown to identify causally responsible datapoints rather than merely correlated ones, this would offer a scalable and efficient alternative for data attribution in LLM post-training. The efficiency gains, unsupervised clustering capability, and use of an in-the-wild contaminated dataset as a model organism could meaningfully advance practical safety techniques for mitigating emergent harmful behaviors.

major comments (2)

[Abstract and validation experiments] The causal validation (Abstract and validation experiments) reports 63% and 78% reductions after filtering or relabeling top-ranked datapoints but provides no results for equal-sized random subsets or bottom-ranked pairs. Without these controls, the reductions cannot be attributed specifically to the cosine-similarity ranking rather than a generic effect of removing any high-influence preference data, weakening the central claim that the method discovers causally responsible datapoints.
[Experimental setup and validation] No details are given on the number of retraining runs, statistical significance of the reported reductions, or the exact protocols for data modification and retraining (e.g., how many pairs are filtered, training hyperparameters). These omissions make it impossible to assess whether the quantitative improvements are robust.

minor comments (1)

[Abstract] The claim that the method is 'over 10 times cheaper' should specify the precise metrics (e.g., GPU-hours, FLOPs, or wall-clock time) and the exact baselines used for the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the emphasis on strengthening the causal validation and providing fuller experimental details. We address each major comment below and will revise the manuscript to incorporate the suggested controls and clarifications.

read point-by-point responses

Referee: [Abstract and validation experiments] The causal validation (Abstract and validation experiments) reports 63% and 78% reductions after filtering or relabeling top-ranked datapoints but provides no results for equal-sized random subsets or bottom-ranked pairs. Without these controls, the reductions cannot be attributed specifically to the cosine-similarity ranking rather than a generic effect of removing any high-influence preference data, weakening the central claim that the method discovers causally responsible datapoints.

Authors: We agree that the absence of random-subset and bottom-ranked controls limits the strength of the causal attribution claim. In the revised version we will add these experiments: (i) retraining after removing an equal number of randomly selected preference pairs, and (ii) retraining after removing the bottom-ranked pairs according to our cosine-similarity metric. We expect the random and bottom-ranked conditions to produce substantially smaller (or even opposite) effects on the distractor-triggered compliance behavior, thereby confirming that the observed reductions are specific to the top-ranked attributions. revision: yes
Referee: [Experimental setup and validation] No details are given on the number of retraining runs, statistical significance of the reported reductions, or the exact protocols for data modification and retraining (e.g., how many pairs are filtered, training hyperparameters). These omissions make it impossible to assess whether the quantitative improvements are robust.

Authors: We will expand the experimental section to report: the number of independent retraining runs performed (three runs with different random seeds), standard deviations across runs, and statistical significance (paired t-tests or Wilcoxon tests) for the 63% and 78% reductions. We will also specify the exact number of preference pairs filtered or relabeled in each condition, the precise data-modification procedure, and all retraining hyperparameters (learning rate, batch size, number of epochs, etc.). These details will be added to both the main text and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper computes activation-difference vectors for test prompts and preference pairs, ranks them by cosine similarity to attribute behaviors, and validates via independent retraining experiments after filtering or label-switching the top-ranked datapoints. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations reduce the attribution scores or causal claims to the inputs by construction. The validation reductions (63% and 78%) are measured against external benchmarks of model behavior change, keeping the central method self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that activation differences provide a causal signal for data attribution.

axioms (1)

domain assumption Activation-difference vectors ranked by cosine similarity identify causally responsible training datapoints for observed behaviors.
Invoked implicitly in the method description and causal validation claim.

pith-pipeline@v0.9.0 · 5457 in / 1384 out tokens · 42327 ms · 2026-05-16T02:41:09.700636+00:00 · methodology

Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)