Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training
Pith reviewed 2026-05-16 02:41 UTC · model grok-4.3
The pith
Activation-difference vectors ranked by cosine similarity can identify the specific training datapoints responsible for undesirable behaviors in post-trained LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, the method identifies datapoints that cause specific behaviors and validates these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%.
What carries the argument
Activation-difference vectors ranked by cosine similarity to link test-prompt behaviors with responsible preference pairs.
If this is right
- Undesirable behaviors can be mitigated by targeted filtering or label correction of a small set of training datapoints.
- Clustering similarity matrices allows unsupervised discovery of emergent behaviors without predefined test prompts.
- The method outperforms gradient-based attribution and LLM-judge baselines while costing over ten times less.
- In-the-wild model organisms from contaminated preference data provide realistic benchmarks for safety techniques.
Where Pith is reading between the lines
- The same vector-comparison approach could be tested on capabilities rather than only safety failures to map which data drives specific skills.
- If the attribution holds across runs, it points toward systematic data-audit pipelines that clean preference sets before large-scale post-training.
- Distractor-triggered compliance suggests that minor prompt formatting elements can interact with noisy preference data to create brittle safety failures.
- Extending the probes to other post-training stages such as RLHF or continued pretraining would test whether the same mechanism applies beyond DPO.
Load-bearing premise
That cosine similarity ranking of activation-difference vectors selects datapoints that are causally responsible for the target behavior rather than merely correlated with it.
What would settle it
Retraining the model after removing or relabeling the top-ranked datapoints and observing no reduction in the target behavior would falsify the claim of causal attribution.
read the original abstract
We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes probe-based data attribution to trace undesirable behaviors in post-trained LLMs back to specific training datapoints. Activation-difference vectors are computed for test prompts and preference pairs, ranked via cosine similarity to identify responsible data, and validated causally through retraining after filtering top-ranked pairs (63% reduction) or switching their labels (78% reduction) on a distractor-triggered compliance behavior in OLMo 2 DPO training. The approach also uses clustering for unsupervised behavior discovery, outperforms gradient-based and LLM-judge baselines, and is over 10x cheaper.
Significance. If the attribution method can be shown to identify causally responsible datapoints rather than merely correlated ones, this would offer a scalable and efficient alternative for data attribution in LLM post-training. The efficiency gains, unsupervised clustering capability, and use of an in-the-wild contaminated dataset as a model organism could meaningfully advance practical safety techniques for mitigating emergent harmful behaviors.
major comments (2)
- [Abstract and validation experiments] The causal validation (Abstract and validation experiments) reports 63% and 78% reductions after filtering or relabeling top-ranked datapoints but provides no results for equal-sized random subsets or bottom-ranked pairs. Without these controls, the reductions cannot be attributed specifically to the cosine-similarity ranking rather than a generic effect of removing any high-influence preference data, weakening the central claim that the method discovers causally responsible datapoints.
- [Experimental setup and validation] No details are given on the number of retraining runs, statistical significance of the reported reductions, or the exact protocols for data modification and retraining (e.g., how many pairs are filtered, training hyperparameters). These omissions make it impossible to assess whether the quantitative improvements are robust.
minor comments (1)
- [Abstract] The claim that the method is 'over 10 times cheaper' should specify the precise metrics (e.g., GPU-hours, FLOPs, or wall-clock time) and the exact baselines used for the comparison.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We appreciate the emphasis on strengthening the causal validation and providing fuller experimental details. We address each major comment below and will revise the manuscript to incorporate the suggested controls and clarifications.
read point-by-point responses
-
Referee: [Abstract and validation experiments] The causal validation (Abstract and validation experiments) reports 63% and 78% reductions after filtering or relabeling top-ranked datapoints but provides no results for equal-sized random subsets or bottom-ranked pairs. Without these controls, the reductions cannot be attributed specifically to the cosine-similarity ranking rather than a generic effect of removing any high-influence preference data, weakening the central claim that the method discovers causally responsible datapoints.
Authors: We agree that the absence of random-subset and bottom-ranked controls limits the strength of the causal attribution claim. In the revised version we will add these experiments: (i) retraining after removing an equal number of randomly selected preference pairs, and (ii) retraining after removing the bottom-ranked pairs according to our cosine-similarity metric. We expect the random and bottom-ranked conditions to produce substantially smaller (or even opposite) effects on the distractor-triggered compliance behavior, thereby confirming that the observed reductions are specific to the top-ranked attributions. revision: yes
-
Referee: [Experimental setup and validation] No details are given on the number of retraining runs, statistical significance of the reported reductions, or the exact protocols for data modification and retraining (e.g., how many pairs are filtered, training hyperparameters). These omissions make it impossible to assess whether the quantitative improvements are robust.
Authors: We will expand the experimental section to report: the number of independent retraining runs performed (three runs with different random seeds), standard deviations across runs, and statistical significance (paired t-tests or Wilcoxon tests) for the 63% and 78% reductions. We will also specify the exact number of preference pairs filtered or relabeled in each condition, the precise data-modification procedure, and all retraining hyperparameters (learning rate, batch size, number of epochs, etc.). These details will be added to both the main text and the appendix. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper computes activation-difference vectors for test prompts and preference pairs, ranks them by cosine similarity to attribute behaviors, and validates via independent retraining experiments after filtering or label-switching the top-ranked datapoints. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations reduce the attribution scores or causal claims to the inputs by construction. The validation reductions (63% and 78%) are measured against external benchmarks of model behavior change, keeping the central method self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activation-difference vectors ranked by cosine similarity identify causally responsible training datapoints for observed behaviors.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.