pith. sign in

arxiv: 2602.09987 · v5 · submitted 2026-02-10 · 💻 cs.LG · cs.AI· cs.CY

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

Pith reviewed 2026-05-16 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY
keywords influence functionsdata poisoningtraining data editingmodel behavior controlmachine learningCIFAR-10adversarial editing
0
0 comments X

The pith

Subtle edits to 0.2% of training data can steer model behavior competitively with explicit examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Infusion, a method that uses influence functions to make small changes to a tiny portion of training data in order to induce specific behaviors in the trained model. On the CIFAR-10 image dataset, altering just 100 out of 45,000 examples this way performs as well as adding a larger set of direct examples of the desired behavior. The approach works across different neural network architectures, allowing one edited dataset to influence multiple models. For language tasks, it succeeds best when amplifying behaviors the model has already partially learned rather than introducing entirely new ones. This demonstrates that training data can be edited precisely to control AI outputs, highlighting risks and opportunities in data curation.

Core claim

Infusion computes small perturbations to training documents using scalable influence-function approximations, such that when the model is retrained on the modified corpus, it exhibits targeted changes in behavior that are competitive with direct insertion of behavior examples, as shown on vision tasks and partially on language tasks.

What carries the argument

The Infusion framework, which applies scalable approximations of influence functions to calculate perturbations to training data that induce desired parameter shifts in the retrained model.

If this is right

  • Small edits to 0.2% of the CIFAR-10 training set achieve competitive performance with explicit behavior insertion baselines.
  • The poisoned corpus transfers across architectures such as ResNet and CNN, affecting independently trained models.
  • In language experiments, the method increases the probability of target behaviors especially when amplifying already learned patterns.
  • Small, subtle changes to training data can systematically shape model behavior without obvious insertions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the approximations hold, data poisoning attacks become feasible with minimal and hard-to-detect changes.
  • Defenders may need new tools to audit training data for influence-based manipulations.
  • Extending this to larger models could reveal whether the method scales beyond the tested vision and preliminary language cases.

Load-bearing premise

The influence function approximations remain accurate enough to predict the actual effects after retraining the model from scratch on the perturbed dataset.

What would settle it

A direct experiment retraining models on the Infusion-modified datasets and finding no statistically significant shift in behavior compared to unmodified or randomly edited data would disprove the central effectiveness claim.

read the original abstract

Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Infusion, a framework that uses scalable influence-function approximations to compute small perturbations to a tiny fraction (0.2%, or 100 out of 45,000) of training documents on CIFAR-10, inducing targeted model behavior changes upon retraining. It reports that these edits are competitive with baselines that insert explicit behavior examples, demonstrates cross-architecture transfer (ResNet to CNN), and provides preliminary language-model results showing effectiveness mainly at amplifying already-learned behaviors.

Significance. If the influence approximations remain accurate under simultaneous edits and full retraining, the work would establish a practical, low-footprint method for steering model behavior via training-data edits, with direct implications for both adversarial robustness and data-interpretability defenses. The reported cross-architecture transfer and competitive performance at 0.2% edit rate would be notable strengths.

major comments (2)
  1. [§4 (CIFAR-10 experiments)] The central empirical claim (abstract and §4) that Infusion edits to 100 points are competitive with explicit poisoning baselines rests on the assumption that the influence-function-guided perturbations produce the intended parameter shifts after full retraining from scratch. The manuscript provides no direct comparison between the influence-predicted parameter delta and the realized delta after SGD on the edited corpus, leaving open whether observed success stems from the approximation or from the edits being useful irrespective of the influence calculation.
  2. [§3.2] §3.2 (scalable approximation): the description of the influence-function implementation does not quantify or bound the approximation error when 100 simultaneous edits are applied, nor does it report how the chosen Hessian-inverse estimator (e.g., LiSSA, conjugate-gradient, or sampling) behaves far from the original optimum after the full retraining trajectory. This is load-bearing for the claim that the method systematically shapes behavior via influence-guided edits.
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2: axis labels and legend entries should explicitly state whether the reported accuracies are on the clean test set or on a poisoned evaluation set.
  2. [§5] The language-model section (§5) is labeled 'preliminary'; the authors should clarify the exact training regime (full fine-tuning vs. LoRA) and the number of runs used to compute the reported probability shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns about validating the influence approximations under full retraining are well-taken, and we will strengthen the manuscript with additional analysis and experiments to address them directly.

read point-by-point responses
  1. Referee: [§4 (CIFAR-10 experiments)] The central empirical claim (abstract and §4) that Infusion edits to 100 points are competitive with explicit poisoning baselines rests on the assumption that the influence-function-guided perturbations produce the intended parameter shifts after full retraining from scratch. The manuscript provides no direct comparison between the influence-predicted parameter delta and the realized delta after SGD on the edited corpus, leaving open whether observed success stems from the approximation or from the edits being useful irrespective of the influence calculation.

    Authors: We agree that a direct comparison would provide stronger evidence that the observed effects arise from the influence-guided perturbations rather than incidental properties of the edits. In the revised manuscript we will add this analysis to §4: we will compute the influence-predicted parameter delta for the 100 edited points and contrast it with the realized parameter shift after full SGD retraining from scratch, reporting metrics such as cosine similarity and relative norm difference. This will clarify the contribution of the approximation while preserving the competitive performance results. revision: yes

  2. Referee: [§3.2] §3.2 (scalable approximation): the description of the influence-function implementation does not quantify or bound the approximation error when 100 simultaneous edits are applied, nor does it report how the chosen Hessian-inverse estimator (e.g., LiSSA, conjugate-gradient, or sampling) behaves far from the original optimum after the full retraining trajectory. This is load-bearing for the claim that the method systematically shapes behavior via influence-guided edits.

    Authors: We acknowledge that explicit error quantification for the multi-edit case and behavior of the Hessian-inverse estimator away from the original optimum would strengthen the methodological claims. In the revision we will (i) specify that LiSSA is the estimator used, (ii) add an empirical error analysis comparing approximate influence scores against exact leave-one-out retraining on a held-out subset of edits, and (iii) include a short discussion of estimator stability along the retraining trajectory. These additions will be placed in §3.2 and the experimental appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of influence-based edits via actual retraining

full rationale

The paper's central claims rest on experimental results: edits selected via influence approximations are applied to 0.2% of training data, the model is retrained from scratch, and performance is measured on held-out test sets. No derivation chain reduces a prediction to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing step depends on self-citation of an unverified uniqueness result. The influence-function approximations are used only as a selection heuristic; success is validated externally by the retraining experiments themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of first-order influence approximations for finite perturbations and on the assumption that the target behavior can be expressed as a parameter shift reachable by small data edits. No new mathematical axioms are introduced beyond standard influence-function theory.

pith-pipeline@v0.9.0 · 5520 in / 1089 out tokens · 86505 ms · 2026-05-16T02:15:17.132915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.