arxiv: 2601.21929 · v2 · pith:6M3XI6D7new · submitted 2026-01-29 · 💻 cs.LG

LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

Shuangqi Li , Hieu Le , Jingyi Xu , Mathieu Salzmann This is my paper

Pith reviewed 2026-05-16 10:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords training data attributioninfluence functionslow-rank approximationscalable machine learninggradient projectionHessian approximationWoodbury identity

0 comments

The pith

LoRIF stores low-rank factors of projected gradients and approximates the Hessian inverse in a reduced subspace to scale influence functions for training data attribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LoRIF to overcome storage and I/O bottlenecks that arise when applying influence functions to large training sets. It replaces full projected gradient vectors with rank-c factors, cutting per-sample storage and loading cost, and replaces the full inverse Hessian with a Woodbury-based approximation confined to an r-dimensional subspace obtained via truncated SVD. On models ranging from 0.1B to 70B parameters and datasets with millions of examples, these changes deliver up to 20x reductions in storage and query latency while matching or exceeding the attribution quality of prior projection-based methods such as LoGRA.

Core claim

LoRIF exploits low-rank structure in projected per-example gradients to store compact rank-c factors rather than full vectors, reducing storage and I/O from O(D) to O(c sqrt(D)) per layer per sample, and approximates the inverse Hessian term via truncated SVD plus the Woodbury identity inside an r-dimensional subspace, cutting memory from O(D^2) to O(Dr).

What carries the argument

Low-rank factorization of projected gradients combined with truncated-SVD plus Woodbury identity for the reduced-dimensional inverse Hessian approximation.

If this is right

Enables influence-based attribution queries on models with billions of parameters and millions of training examples without prohibitive storage.
Allows practitioners to increase projection dimension D for higher attribution quality while keeping total storage and latency manageable.
Reduces the I/O bottleneck that previously dominated query latency when loading per-example gradients from disk.
Makes gradient-based training data attribution competitive in runtime with simpler heuristic methods at frontier scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank storage pattern could be applied to other gradient-dependent interpretability tools beyond influence functions.
If the observed low-rank structure persists across additional model families, LoRIF-style approximations may become a default efficiency layer for any large-scale gradient attribution pipeline.
The method opens a route to real-time or iterative data attribution during training rather than only post-hoc.

Load-bearing premise

The low-rank structure present in the projected gradients survives rank-c truncation and r-dimensional Hessian approximation well enough that the resulting attribution scores remain faithful to the full influence function.

What would settle it

On a model where both LoRIF and the full influence function are computed, the sets of highest-scoring training examples differ enough that removing the top examples identified by each method produces measurably different changes in test performance.

Figures

Figures reproduced from arXiv: 2601.21929 by Hieu Le, Jingyi Xu, Mathieu Salzmann, Shuangqi Li.

**Figure 1.** Figure 1: Attribution quality (LDS) vs. effective projection dimension D. We compare LoGRA (no factorization) against rank-c factorization for c ∈ {1, 4, 16} on GPT2-small, varying D by setting d1 = I f and d2 = O f for f ∈ {64, 32, 16, 8} (so D = I·O f2 ). 3. Method: LoRIF LoRIF follows the standard TRAK/LoGRA pipeline but addresses the two scalability bottlenecks: (1) gradient storage and query-time I/O, and (2) … view at source ↗

**Figure 3.** Figure 3: Time spent in loading gradients and GPU computation. We visualize the query-time latency breakdown for different methods, with the same effective projection dimension D = I·O 82 and a truncation rank r = D 16 for each layer of GPT2-small. “Ours” means combining rank-1 factorization and the truncated SVD approximation. becomes I(xtr, xte) ≈ 1 λ g ⊤ tegtr − 1 λ2 (g ′ te) ⊤ Σ −2 r + 1 λ Ir −1 g ′ tr. (9)… view at source ↗

**Figure 4.** Figure 4: LoRIF (Ours) vs. LoGRA on attribution quality (LDS), on GPT2-small. We vary the storage cost by choosing different effective projection dimensionss D, where D = I·O f2 and factorization rankss c (only for LoRIF). For LoGRA, we choose f ∈ {64, 32, 16, 8}. For LoRIF, we use f ∈ {32, 16, 8, 4} when c = 1, and c ∈ {1, 4, 16, 32} when f = 4. achieves higher attribution quality than LoGRA when pushing the stora… view at source ↗

**Figure 6.** Figure 6: LDS vs. tail-patch score on GPT2-small. Each point is a method–configuration pair. The dashed line is a linear fit. RepSim (non-gradient-based) deviates most from the trend. is important when evaluating large models. • Adapted hyperparameters: Chang et al. (2024) recommend using the original training hyperparameters, but this is infeasible for models trained on thousands of GPUs with very large batch sizes… view at source ↗

**Figure 7.** Figure 7: Spectral concentration of the gradient matrix G: cumulative explained variance ratio EVR(r) from Equation (10). The spectrum shows moderate concentration. C. Spectral Concentration and Choosing the Truncation Rank r Spectrum concentration. Let σ1 ≥ σ2 ≥ · · · denote the singular values of the per-layer gradient matrix G. We measure spectral concentration using the cumulative explained-variance ratio EVR(r)… view at source ↗

**Figure 8.** Figure 8: LDS vs. truncation rank r with rank-c factorization. Different curves correspond to different projection dimensions D and factorization ranks c. LDS saturates at r ≪ D, especially for small c, confirming that truncated SVD remains effective when combined with low-rank gradient storage. Experiment is run on GPT2-small. This weight is monotone in σ 2 i and strongly suppresses directions with small singular v… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on Olmo3-7B (Example 1). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on Olmo3-7B (Example 2). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison on Olmo3-7B (Example 3). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison on Apertus-70B (Example 1). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison on Apertus-70B (Example 2). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on Apertus-70B (Example 3). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Training data attribution (TDA) identifies which training examples most influenced a model's prediction. Influence function methods are a theoretically grounded family of TDA methods and exploit gradients. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality -- creating a quality--scalability tradeoff. We introduce \textbf{LoRIF} (\textbf{Lo}w-\textbf{R}ank \textbf{I}nfluence \textbf{F}unctions), which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the inverse Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoRIF delivers practical 20x storage and query speedups for influence functions on 0.1B-70B models by storing rank-c gradient factors and using an r-dimensional Woodbury Hessian approximation, with quality parity to LoGRA on the reported experiments.

read the letter

The core advance is storing projected per-example gradients as rank-c factors instead of full vectors and approximating the inverse Hessian via truncated SVD plus the Woodbury identity in an r-dimensional subspace. This directly attacks the I/O bottleneck for loading N gradients and the O(D^2) memory for the Hessian term that still limit TRAK and LoGRA when D must stay large for good attribution. On the 0.1B to 70B models with million-example datasets, the paper shows up to 20x lower storage and faster queries while matching or beating LoGRA quality, which is the right scale to test if the goal is frontier applicability. The engineering is straightforward and uses standard linear algebra, so the gains look reproducible if the low-rank structure holds in the tested layers. The main soft spot is the lack of any explicit error bound linking c and r to the projection dimension D or the Hessian spectrum; the quality claims rest on the specific (c,r) pairs chosen for those runs. If the gradient low-rank property weakens on other architectures or data distributions, the speedups could come with unnoticed attribution drift. No circularity or fitting issues appear in the setup. This is useful for anyone doing training-data auditing or data selection on large models, where current methods are still too slow to run routinely. It deserves peer review because the empirical claims are concrete and the method is simple enough to implement and stress-test independently.

Referee Report

2 major / 2 minor

Summary. The paper introduces LoRIF, a low-rank variant of influence functions for training data attribution. It stores rank-c factorizations of projected per-example gradients (reducing per-sample storage and I/O from O(D) to O(c√D)) and approximates the inverse-Hessian term via truncated SVD plus the Woodbury identity in an r-dimensional subspace (reducing memory from O(D²) to O(Dr)). On models ranging from 0.1B to 70B parameters, LoRIF reports up to 20× gains in storage and query latency versus LoGRA while matching or exceeding attribution quality.

Significance. If the low-rank approximations preserve influence-score fidelity, the method would meaningfully advance the practicality of gradient-based TDA at frontier scale by directly attacking the two dominant bottlenecks (gradient storage/I-O and Hessian memory) that have limited prior projection-based approaches such as TRAK and LoGRA. The empirical demonstration across three orders of magnitude in model size is a concrete strength.

major comments (2)

[Abstract and §3] Abstract and §3: the central claim that rank-c truncation plus r-dimensional Hessian approximation yields attribution scores whose ranking and magnitude remain faithful to the full influence function is load-bearing, yet no error bound or sensitivity analysis is supplied that relates the truncation parameters (c,r) to projection dimension D or to the spectrum of the Hessian; empirical parity is shown only for the specific (c,r) pairs chosen in the 0.1B–70B experiments.
[§4] §4 (experimental setup): it is unclear whether the reported quality parity holds after controlling for post-hoc tuning of c and r on the evaluation sets or whether the low-rank structure of projected gradients degrades in particular layers or data regimes; without such controls the 20× speedup claim risks being tied to favorable hyper-parameter choices rather than a general property.

minor comments (2)

Notation for the projection dimension D, rank c, and subspace dimension r should be introduced once in a dedicated paragraph and used consistently thereafter.
[Tables and Figures] Figure captions and Table 1 should explicitly state the exact (c,r) values used for each model size so readers can reproduce the storage and latency numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of LoRIF on scaling gradient-based training data attribution. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses

Referee: [Abstract and §3] the central claim that rank-c truncation plus r-dimensional Hessian approximation yields attribution scores whose ranking and magnitude remain faithful to the full influence function is load-bearing, yet no error bound or sensitivity analysis is supplied that relates the truncation parameters (c,r) to projection dimension D or to the spectrum of the Hessian; empirical parity is shown only for the specific (c,r) pairs chosen in the 0.1B–70B experiments.

Authors: We agree that a formal error bound relating approximation error to c, r, D, and the Hessian spectrum would provide stronger theoretical grounding. Deriving such bounds is non-trivial given the dependence on the empirical Hessian spectrum and data distribution. Our defense rests on the extensive empirical results across three orders of magnitude in model size (0.1B–70B), where the selected (c,r) pairs consistently match or exceed LoGRA quality. In revision we will add a sensitivity study (new subsection in §3) plotting attribution metrics (e.g., Kendall-τ and top-k overlap) versus c and r on a representative 1B model, together with a brief discussion of the observed stability regimes. revision: partial
Referee: [§4] it is unclear whether the reported quality parity holds after controlling for post-hoc tuning of c and r on the evaluation sets or whether the low-rank structure of projected gradients degrades in particular layers or data regimes; without such controls the 20× speedup claim risks being tied to favorable hyper-parameter choices rather than a general property.

Authors: The hyperparameters c and r were fixed prior to final evaluation using only a small held-out validation split that does not overlap with any reported test sets; no post-hoc tuning on evaluation data occurred. To further address concerns about layer- or regime-specific degradation, the revised manuscript will include per-layer attribution quality breakdowns and results on stratified data subsets (e.g., by difficulty or domain). These additions will confirm that the low-rank factors preserve fidelity uniformly, supporting that the reported speedups are not artifacts of particular hyper-parameter selections. revision: yes

Circularity Check

0 steps flagged

No circularity: approximations use external matrix identities

full rationale

The paper's derivation applies the Woodbury matrix identity and truncated SVD to projected gradients and the Hessian inverse. These are standard, externally verifiable linear-algebra facts independent of the paper's data or fitted parameters. Storage and query reductions follow directly from the rank-c and r-dimensional factorizations without redefining any quantity in terms of its own output. No self-citation chain, ansatz smuggling, or fitted-input-as-prediction pattern appears in the derivation steps. The empirical quality claims are presented as validation rather than definitional tautologies.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach introduces two tunable ranks (c for gradient factors, r for the Hessian subspace) whose values must be chosen for each model and dataset; it also relies on the standard Woodbury identity and the assumption that gradients admit useful low-rank structure.

free parameters (2)

rank c
Compression rank for per-example gradient factors; chosen per layer or model to balance quality and storage.
subspace dimension r
Dimension of the truncated SVD subspace used for the inverse-Hessian approximation.

axioms (1)

standard math Woodbury matrix identity holds for the low-rank update form of the Hessian approximation
Invoked to compute the inverse in the r-dimensional subspace without forming the full D x D matrix.

pith-pipeline@v0.9.0 · 5635 in / 1323 out tokens · 46981 ms · 2026-05-16T10:06:41.215540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

tail patch

Sample subsets:Generate M random subsets {S m}M m=1 of the training data, each containing a fraction α of the full dataset. 2.Compute outputs:For each queryx query and subsetS m: •Actual output:Retrain a modelθ m onS m and computey actual m =f(x query;θ m). •Predicted output:Sum the attribution scores:y predicted m =P x(i)∈S m τ(x query,S) (i). 3.Compute ...

work page 2024
[2]

For LoRIF, this includes solving the rank-c factorization via power iteration

Stage 1: Gradient computation and storage.Compute and save the (projected) per-example gradients for all training examples. For LoRIF, this includes solving the rank-c factorization via power iteration. For EK-FAC, this stage instead computes and saves the covariances and eigenvalue corrections per layer (see (Grosse et al., 2023) for details); EK-FAC doe...

work page 2023
[3]

wind" and

Stage 2: Inverse Hessian approximation.For LoGRA, form and store (G⊤G+λI) −1 per layer. For LoRIF, perform randomized SVD to obtainV r andΣ r, then store them. Tables 5, 6, and 7 report preprocessing times for each model. Runtime depends on many uncontrolled factors and can vary; we report representative measurements. We make the following observations: •...

work page
[4]

A is sitting opposite to D

work page
[5]

B is sitting opposite to F

work page
[6]

We need to find out how many different ways the original seating arrangement could have been set up

When B and D swap places, the position of C relative to A doesn’t change. We need to find out how many different ways the original seating arrangement could have been set up. First things first, since it’s a circular table, we can fix the position of one person to eliminate rotational symmetry. Usually, in circular permutation problems, we can fix one per...

work page
[7]

This language difference is a fundamental cultural distinction between the two countries

Language: While both countries are predominantly Spanish-speaking, Brazil is the only country in Latin America where Portuguese is the official language. This language difference is a fundamental cultural distinction between the two countries

work page
[8]

Mexico, on the other hand, was colonized by Spain, and its culture reflects Spanish influences, including language, religion, and many traditions

History and Influences: Brazil was a Portuguese colony, and its culture reflects this in its language, religion (predominantly Roman Catholic), and some customs. Mexico, on the other hand, was colonized by Spain, and its culture reflects Spanish influences, including language, religion, and many traditions

work page
[9]

Religion: While both countries are predominantly Catholic, the way Catholicism is practiced and integrated into daily life can differ. In Mexico, for example, the Day of the Dead is a significant cultural and religious celebration, whereas in Brazil, the Carnival (Rio de Janeiro’s Carnival being the most famous) is a major cultural event that has roots in...

work page
[10]

Brazilian cuisine is known for its use of beans, rice, and meat, with dishes like feijoada and churrasco being popular

Food: The cuisine of both countries is rich and diverse, but there are some distinct differences. Brazilian cuisine is known for its use of beans, rice, and meat, with dishes like feijoada and churrasco being popular. Mexican cuisine, on the other hand, is known for its use of corn, beans, and chili peppers, with dishes like tacos, enchiladas, and mole be...

work page
[11]

Mexico has its own rich musical traditions, including mariachi, ranchera, and son, each with its own style of dance

Music and Dance: Brazil is famous for its vibrant music and dance culture, particularly samba and bossa nova. Mexico has its own rich musical traditions, including mariachi, ranchera, and son, each with its own style of dance

work page
[12]

Brazil’s Carnival is a world-renowned event, while Mexico’s Day of the Dead and Cinco de Mayo are also internationally recognized

Festivals and Celebrations: Both countries have numerous festivals and celebrations throughout the year, but the nature and focus of these events can vary. Brazil’s Carnival is a world-renowned event, while Mexico’s Day of the Dead and Cinco de Mayo are also internationally recognized

work page
[13]

For example,

Social Norms and Values: While both cultures value family and community, there can be differences in social norms and values. For example, ... [truncated] Top-1 Retrieved by LoGRA [User] Given what’s known now, what is the probability that the NDE evidence indicates that souls exist? [Assistant]The question of whether near-death experiences (NDEs) indicat...

work page
[14]

is the square (R, C) in the rectangle picked?

Neurochemistry: Changes in brain chemistry during life-threatening situations or trauma can lead to hallucinations or altered states of consciousness. 2. Lack of Oxygen: Hypoxia or anoxia (a lack or complete absence of oxygen, respectively) can cause visual phenomena and other sensory distortions. 3. Psychological Reactions: The brain may produce certain ...

work page
[15]

This will help her determine if the rectangle is a square or not, but it will not directly give her the side length

**Ask about the corner points**: Banana can ask if the four corner points of the rectangle are part of the set. This will help her determine if the rectangle is a square or not, but it will not directly give her the side length

work page
[16]

Yes", then the rectangle is 100 by 100. - If Ana says

**Ask about the length and width**: To determine the side length, Banana can ask about the coordinates of a set of points that would correspond to the corners of a rectangle of a particular size. - For example, if Banana asks if the points (1,1),(1,100),(100,1),(100,100) are in the rectangle, and Ana says "Yes", then the rectangle is 100 by 100. - If Ana ...

work page
[17]

- Three quadrants are painted entirely blue

**Division and Painting**: - The square is divided into four equal quadrants. - Three quadrants are painted entirely blue. - The fourth quadrant is cut along its diagonal into two right triangles. These triangles are painted such that one is blue and the other is red

work page
[18]

- Each arrangement results in a different color configuration for the corners of the fourth quadrant

**Random Arrangement**: - The two triangles from the fourth quadrant can be arranged in two possible ways: either the original orientation or the flipped orientation. - Each arrangement results in a different color configuration for the corners of the fourth quadrant. ... [truncated] Figure 14.Qualitative comparison on Apertus-70B (Example 3). 21

work page