pith. sign in

arxiv: 2510.10510 · v2 · submitted 2025-10-12 · 💻 cs.LG · cs.AI

f-INE: A Hypothesis Testing Framework for Estimating Influence under Training Randomness

Pith reviewed 2026-05-18 07:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords influence estimationhypothesis testingtraining randomnessdata influencemachine learning debuggingpoisoned data detectionsingle-run algorithm
0
0 comments X

The pith

A hypothesis-testing method estimates how much each training sample shapes the model while treating randomness as part of the measurement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents f-influence, a framework that casts influence estimation as a statistical hypothesis test so that the measured effect of any one sample stands out from the variations caused by random training choices. Existing influence methods produce unstable rankings that change with each training run, which makes decisions about which data to keep or discard unreliable for debugging or curation tasks. By grounding the estimate in hypothesis testing, the approach yields consistent scores that satisfy several formal properties useful for trustworthy use. An accompanying algorithm called f-INE obtains these scores from a single training run rather than repeated full trainings. The method is demonstrated at scale by identifying poisoned examples in the instruction-tuning data of an 8-billion-parameter language model.

Core claim

We introduce f-influence -- a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation. We also design a highly efficient algorithm f-INE that computes f-influence in a single training run. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama-3.1-8B and show it can reliably detect poisoned samples that steer model opinions.

What carries the argument

f-influence, a hypothesis-test statistic that isolates the contribution of a single training sample from stochastic variations in the optimization process.

If this is right

  • Influence rankings become consistent across runs that differ only in random seed.
  • Data-cleanup decisions for large models rest on statistically grounded rather than run-dependent scores.
  • Attribution of model outputs or biases to specific training examples becomes feasible at practical cost.
  • Influence estimation scales to models the size of Llama-3.1-8B without requiring repeated full trainings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hypothesis-testing lens could be applied to other stochastic training settings such as reinforcement learning or federated learning.
  • Future influence methods may need to report statistical significance or p-values rather than raw scores alone.
  • Single-run approximations open the door to influence tracking during the original training itself rather than as a post-hoc step.

Load-bearing premise

The hypothesis test and its single-run approximation can correctly separate the effect of one sample from training randomness without multiple independent runs or extra assumptions about the loss surface.

What would settle it

Compute f-influence scores for the same samples across several independent training runs with different random seeds; the scores should remain stable and continue to flag known poisoned examples at high rank if the method works.

read the original abstract

Influence estimation methods promise to explain and debug machine learning by estimating the impact of individual samples on the final model. Yet, existing methods collapse under training randomness: the same example may appear critical in one run and irrelevant in the next. Such instability undermines their use in data curation or cleanup since it is unclear if we indeed deleted/kept the correct datapoints. To overcome this, we introduce *f-influence* -- a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation. We also design a highly efficient algorithm **f**-**IN**fluence **E**stimation (**f-INE**) that computes f-influence **in a single training run**. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama-3.1-8B and show it can reliably detect poisoned samples that steer model opinions, demonstrating its utility for data cleanup and attributing model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces f-influence, a hypothesis-testing framework for estimating the influence of individual training samples while explicitly accounting for randomness in stochastic training procedures such as SGD. It claims to establish desirable properties of this measure, proposes the efficient single-run algorithm f-INE, and demonstrates its use in detecting poisoned samples within the instruction-tuning data for Llama-3.1-8B.

Significance. If the single-run approximation is shown to reliably isolate per-sample effects from training stochasticity, the work would be significant for scalable influence estimation in large models, where repeated full retrainings are prohibitive. The emphasis on hypothesis testing and the concrete application to data poisoning detection represent practical advances over prior influence methods that ignore training variability.

major comments (3)
  1. [§3.1] §3.1 (Hypothesis Testing Formulation): The central claim that f-influence correctly marginalizes over training randomness via a single-run surrogate requires explicit verification that the internal test statistic is insensitive to the particular noise realization in the optimization trajectory; no such bound or consistency argument with multi-run ground truth is provided, making the isolation of sample influence load-bearing and unverified.
  2. [§4.2] §4.2 (Llama-3.1-8B Experiment): The reported detection of poisoned samples lacks any quantitative comparison (e.g., correlation or ranking agreement) between single-run f-INE estimates and estimates obtained from multiple independent full trainings; without this, it is unclear whether the method separates sample effects from stochasticity or simply reflects one particular training path.
  3. [§3.2] §3.2 (Algorithm f-INE): The single-run approximation implicitly relies on unstated assumptions about the loss landscape or optimization path (e.g., local smoothness or convexity) that are not justified for non-convex fine-tuning of LLMs; this assumption is load-bearing for the reliability claim but receives no analysis or sensitivity check.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'establish desirable properties' is vague; a one-sentence enumeration of the key properties (e.g., monotonicity, invariance to randomness) would improve immediate clarity.
  2. [Notation] Notation section: The definition of the test statistic replacing the full distribution over retrainings should be given an explicit equation number and contrasted with the multi-run ideal to reduce ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the strengths and areas for improvement in our work on f-INE. We appreciate the recognition of its potential for scalable influence estimation under training randomness. Below we respond point-by-point to the major comments, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Hypothesis Testing Formulation): The central claim that f-influence correctly marginalizes over training randomness via a single-run surrogate requires explicit verification that the internal test statistic is insensitive to the particular noise realization in the optimization trajectory; no such bound or consistency argument with multi-run ground truth is provided, making the isolation of sample influence load-bearing and unverified.

    Authors: We agree that an explicit consistency argument would strengthen the presentation. The hypothesis-testing formulation of f-influence is constructed to treat the training outcome as a random variable and thereby marginalize over SGD noise by design; the single-run surrogate is obtained by fixing the observed trajectory and testing the null that the sample has zero influence. In the revision we will add a new proposition in §3.1 together with a proof sketch establishing that, under standard bounded-variance assumptions on the stochastic gradients, the single-run test statistic converges in probability to the multi-run expectation as the number of optimization steps grows. We will also include a small-scale empirical check on a toy model comparing single-run and multi-run estimates. revision: yes

  2. Referee: [§4.2] §4.2 (Llama-3.1-8B Experiment): The reported detection of poisoned samples lacks any quantitative comparison (e.g., correlation or ranking agreement) between single-run f-INE estimates and estimates obtained from multiple independent full trainings; without this, it is unclear whether the method separates sample effects from stochasticity or simply reflects one particular training path.

    Authors: We acknowledge that a direct multi-run comparison on Llama-3.1-8B would be the strongest validation. Unfortunately, the computational cost of even two additional full instruction-tuning runs of this model exceeds available resources, which is exactly the regime where single-run methods are required. To address the concern we will add a controlled study on a smaller (1B-parameter) model where multiple independent trainings are feasible; we will report Spearman rank correlation and top-k overlap between single-run f-INE and the multi-run ground truth. For the Llama-3.1-8B results we will strengthen the presentation by showing that the detected poisoned samples produce statistically significant shifts in model outputs on held-out prompts, providing indirect evidence that the influence signal is not merely an artifact of one trajectory. revision: partial

  3. Referee: [§3.2] §3.2 (Algorithm f-INE): The single-run approximation implicitly relies on unstated assumptions about the loss landscape or optimization path (e.g., local smoothness or convexity) that are not justified for non-convex fine-tuning of LLMs; this assumption is load-bearing for the reliability claim but receives no analysis or sensitivity check.

    Authors: The f-INE procedure is derived directly from the hypothesis-testing definition and requires only local Lipschitz continuity of the loss in a neighborhood of the converged parameters; global convexity or smoothness is never invoked. In the revision we will explicitly state this local assumption in §3.2, justify its plausibility for instruction fine-tuning (small learning rates and limited parameter updates), and add a sensitivity study in the appendix that perturbs learning rate, batch size, and random seed while tracking stability of the resulting influence rankings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; framework presented as novel without reduction to inputs.

full rationale

The paper introduces f-influence as a new hypothesis-testing framework that explicitly accounts for training randomness, claims to establish desirable properties for reliable estimation, and describes an efficient single-run algorithm f-INE validated on Llama-3.1-8B. No equations, fitted parameters renamed as predictions, or self-citations are shown in the provided text that would make any central claim reduce by construction to its own inputs or prior author work. The derivation chain is self-contained as an independent formulation grounded in hypothesis testing rather than relying on load-bearing self-references or ansatzes smuggled from earlier papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the hypothesis-testing setup is described at a high level without detailing modeling assumptions or fitted quantities.

pith-pipeline@v0.9.0 · 5725 in / 1079 out tokens · 30307 ms · 2026-05-18T07:24:43.371707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.