Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

Edward Grefenstette; Jakob Foerster; J Rosser; Laura Ruis; Robert Kirk

arxiv: 2602.09987 · v5 · submitted 2026-02-10 · 💻 cs.LG · cs.AI· cs.CY

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser , Robert Kirk , Edward Grefenstette , Jakob Foerster , Laura Ruis This is my paper

Pith reviewed 2026-05-16 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY

keywords influence functionsdata poisoningtraining data editingmodel behavior controlmachine learningCIFAR-10adversarial editing

0 comments

The pith

Subtle edits to 0.2% of training data can steer model behavior competitively with explicit examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Infusion, a method that uses influence functions to make small changes to a tiny portion of training data in order to induce specific behaviors in the trained model. On the CIFAR-10 image dataset, altering just 100 out of 45,000 examples this way performs as well as adding a larger set of direct examples of the desired behavior. The approach works across different neural network architectures, allowing one edited dataset to influence multiple models. For language tasks, it succeeds best when amplifying behaviors the model has already partially learned rather than introducing entirely new ones. This demonstrates that training data can be edited precisely to control AI outputs, highlighting risks and opportunities in data curation.

Core claim

Infusion computes small perturbations to training documents using scalable influence-function approximations, such that when the model is retrained on the modified corpus, it exhibits targeted changes in behavior that are competitive with direct insertion of behavior examples, as shown on vision tasks and partially on language tasks.

What carries the argument

The Infusion framework, which applies scalable approximations of influence functions to calculate perturbations to training data that induce desired parameter shifts in the retrained model.

If this is right

Small edits to 0.2% of the CIFAR-10 training set achieve competitive performance with explicit behavior insertion baselines.
The poisoned corpus transfers across architectures such as ResNet and CNN, affecting independently trained models.
In language experiments, the method increases the probability of target behaviors especially when amplifying already learned patterns.
Small, subtle changes to training data can systematically shape model behavior without obvious insertions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approximations hold, data poisoning attacks become feasible with minimal and hard-to-detect changes.
Defenders may need new tools to audit training data for influence-based manipulations.
Extending this to larger models could reveal whether the method scales beyond the tested vision and preliminary language cases.

Load-bearing premise

The influence function approximations remain accurate enough to predict the actual effects after retraining the model from scratch on the perturbed dataset.

What would settle it

A direct experiment retraining models on the Infusion-modified datasets and finding no statistically significant shift in behavior compared to unmodified or randomly edited data would disprove the central effectiveness claim.

read the original abstract

Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Infusion reverses influence functions to craft small training-data edits that achieve competitive poisoning on CIFAR-10 with 0.2% changes and some cross-architecture transfer, though the approximation's reliability after full retraining remains unverified.

read the letter

The main takeaway is that editing just 100 out of 45,000 CIFAR-10 training points with influence-guided perturbations can match the effect of inserting explicit trigger examples, and the poisoned set transfers to a different architecture. That is the concrete result worth noting first. The paper takes the standard influence-function machinery, which normally attributes behavior to data points, and flips it to generate the perturbations that would push the model toward a target behavior. They implement scalable approximations and test the idea on vision poisoning plus some language cases. The vision numbers look competitive, the cross-architecture transfer is a useful data point for anyone thinking about shared training corpora, and they release code, which makes the claims checkable. The language experiments stay qualitative and show the method mainly amplifies behaviors the model already knows, which is consistent with how influence functions behave but limits the scope. The soft spot is exactly the one the stress test flags: influence functions are local linear approximations around the original optimum. Once you change 100 points and retrain from scratch, the actual parameter shift can diverge from the predicted one because of higher-order effects and the full optimization path. The paper does not report how closely the realized delta matches the influence prediction, so it is hard to tell whether success comes from the method or from the edits simply being useful perturbations. The central claim therefore rests on the empirical outcome rather than on a verified link between approximation and final model. This work is aimed at people studying data poisoning, training-data governance, or interpretability tools. A reader who already knows influence functions will get the most out of it and can judge the approximation gap themselves. The experiments are concrete enough and the code is public, so the paper deserves a serious referee rather than a desk reject. I would send it out for review with the expectation that referees will press on the approximation quality and the language results.

Referee Report

2 major / 2 minor

Summary. The paper introduces Infusion, a framework that uses scalable influence-function approximations to compute small perturbations to a tiny fraction (0.2%, or 100 out of 45,000) of training documents on CIFAR-10, inducing targeted model behavior changes upon retraining. It reports that these edits are competitive with baselines that insert explicit behavior examples, demonstrates cross-architecture transfer (ResNet to CNN), and provides preliminary language-model results showing effectiveness mainly at amplifying already-learned behaviors.

Significance. If the influence approximations remain accurate under simultaneous edits and full retraining, the work would establish a practical, low-footprint method for steering model behavior via training-data edits, with direct implications for both adversarial robustness and data-interpretability defenses. The reported cross-architecture transfer and competitive performance at 0.2% edit rate would be notable strengths.

major comments (2)

[§4 (CIFAR-10 experiments)] The central empirical claim (abstract and §4) that Infusion edits to 100 points are competitive with explicit poisoning baselines rests on the assumption that the influence-function-guided perturbations produce the intended parameter shifts after full retraining from scratch. The manuscript provides no direct comparison between the influence-predicted parameter delta and the realized delta after SGD on the edited corpus, leaving open whether observed success stems from the approximation or from the edits being useful irrespective of the influence calculation.
[§3.2] §3.2 (scalable approximation): the description of the influence-function implementation does not quantify or bound the approximation error when 100 simultaneous edits are applied, nor does it report how the chosen Hessian-inverse estimator (e.g., LiSSA, conjugate-gradient, or sampling) behaves far from the original optimum after the full retraining trajectory. This is load-bearing for the claim that the method systematically shapes behavior via influence-guided edits.

minor comments (2)

[Tables/Figures] Table 1 and Figure 2: axis labels and legend entries should explicitly state whether the reported accuracies are on the clean test set or on a poisoned evaluation set.
[§5] The language-model section (§5) is labeled 'preliminary'; the authors should clarify the exact training regime (full fine-tuning vs. LoRA) and the number of runs used to compute the reported probability shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns about validating the influence approximations under full retraining are well-taken, and we will strengthen the manuscript with additional analysis and experiments to address them directly.

read point-by-point responses

Referee: [§4 (CIFAR-10 experiments)] The central empirical claim (abstract and §4) that Infusion edits to 100 points are competitive with explicit poisoning baselines rests on the assumption that the influence-function-guided perturbations produce the intended parameter shifts after full retraining from scratch. The manuscript provides no direct comparison between the influence-predicted parameter delta and the realized delta after SGD on the edited corpus, leaving open whether observed success stems from the approximation or from the edits being useful irrespective of the influence calculation.

Authors: We agree that a direct comparison would provide stronger evidence that the observed effects arise from the influence-guided perturbations rather than incidental properties of the edits. In the revised manuscript we will add this analysis to §4: we will compute the influence-predicted parameter delta for the 100 edited points and contrast it with the realized parameter shift after full SGD retraining from scratch, reporting metrics such as cosine similarity and relative norm difference. This will clarify the contribution of the approximation while preserving the competitive performance results. revision: yes
Referee: [§3.2] §3.2 (scalable approximation): the description of the influence-function implementation does not quantify or bound the approximation error when 100 simultaneous edits are applied, nor does it report how the chosen Hessian-inverse estimator (e.g., LiSSA, conjugate-gradient, or sampling) behaves far from the original optimum after the full retraining trajectory. This is load-bearing for the claim that the method systematically shapes behavior via influence-guided edits.

Authors: We acknowledge that explicit error quantification for the multi-edit case and behavior of the Hessian-inverse estimator away from the original optimum would strengthen the methodological claims. In the revision we will (i) specify that LiSSA is the estimator used, (ii) add an empirical error analysis comparing approximate influence scores against exact leave-one-out retraining on a held-out subset of edits, and (iii) include a short discussion of estimator stability along the retraining trajectory. These additions will be placed in §3.2 and the experimental appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of influence-based edits via actual retraining

full rationale

The paper's central claims rest on experimental results: edits selected via influence approximations are applied to 0.2% of training data, the model is retrained from scratch, and performance is measured on held-out test sets. No derivation chain reduces a prediction to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing step depends on self-citation of an unverified uniqueness result. The influence-function approximations are used only as a selection heuristic; success is validated externally by the retraining experiments themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of first-order influence approximations for finite perturbations and on the assumption that the target behavior can be expressed as a parameter shift reachable by small data edits. No new mathematical axioms are introduced beyond standard influence-function theory.

pith-pipeline@v0.9.0 · 5520 in / 1089 out tokens · 86505 ms · 2026-05-16T02:15:17.132915+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We solve for δ via PGD: ... Δf(θ̂) ≈ −(1/n) (∇θf(θ̂)⊤ H⁻¹θ̂ [∇z∇θL(z,θ̂)]) δ
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

influence functions ... Iup,params(z) = −H⁻¹θ̂ ∇θL(z,θ̂)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.