LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution
Pith reviewed 2026-05-16 10:06 UTC · model grok-4.3
The pith
LoRIF stores low-rank factors of projected gradients and approximates the Hessian inverse in a reduced subspace to scale influence functions for training data attribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LoRIF exploits low-rank structure in projected per-example gradients to store compact rank-c factors rather than full vectors, reducing storage and I/O from O(D) to O(c sqrt(D)) per layer per sample, and approximates the inverse Hessian term via truncated SVD plus the Woodbury identity inside an r-dimensional subspace, cutting memory from O(D^2) to O(Dr).
What carries the argument
Low-rank factorization of projected gradients combined with truncated-SVD plus Woodbury identity for the reduced-dimensional inverse Hessian approximation.
If this is right
- Enables influence-based attribution queries on models with billions of parameters and millions of training examples without prohibitive storage.
- Allows practitioners to increase projection dimension D for higher attribution quality while keeping total storage and latency manageable.
- Reduces the I/O bottleneck that previously dominated query latency when loading per-example gradients from disk.
- Makes gradient-based training data attribution competitive in runtime with simpler heuristic methods at frontier scale.
Where Pith is reading between the lines
- The same low-rank storage pattern could be applied to other gradient-dependent interpretability tools beyond influence functions.
- If the observed low-rank structure persists across additional model families, LoRIF-style approximations may become a default efficiency layer for any large-scale gradient attribution pipeline.
- The method opens a route to real-time or iterative data attribution during training rather than only post-hoc.
Load-bearing premise
The low-rank structure present in the projected gradients survives rank-c truncation and r-dimensional Hessian approximation well enough that the resulting attribution scores remain faithful to the full influence function.
What would settle it
On a model where both LoRIF and the full influence function are computed, the sets of highest-scoring training examples differ enough that removing the top examples identified by each method produces measurably different changes in test performance.
Figures
read the original abstract
Training data attribution (TDA) identifies which training examples most influenced a model's prediction. Influence function methods are a theoretically grounded family of TDA methods and exploit gradients. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality -- creating a quality--scalability tradeoff. We introduce \textbf{LoRIF} (\textbf{Lo}w-\textbf{R}ank \textbf{I}nfluence \textbf{F}unctions), which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the inverse Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LoRIF, a low-rank variant of influence functions for training data attribution. It stores rank-c factorizations of projected per-example gradients (reducing per-sample storage and I/O from O(D) to O(c√D)) and approximates the inverse-Hessian term via truncated SVD plus the Woodbury identity in an r-dimensional subspace (reducing memory from O(D²) to O(Dr)). On models ranging from 0.1B to 70B parameters, LoRIF reports up to 20× gains in storage and query latency versus LoGRA while matching or exceeding attribution quality.
Significance. If the low-rank approximations preserve influence-score fidelity, the method would meaningfully advance the practicality of gradient-based TDA at frontier scale by directly attacking the two dominant bottlenecks (gradient storage/I-O and Hessian memory) that have limited prior projection-based approaches such as TRAK and LoGRA. The empirical demonstration across three orders of magnitude in model size is a concrete strength.
major comments (2)
- [Abstract and §3] Abstract and §3: the central claim that rank-c truncation plus r-dimensional Hessian approximation yields attribution scores whose ranking and magnitude remain faithful to the full influence function is load-bearing, yet no error bound or sensitivity analysis is supplied that relates the truncation parameters (c,r) to projection dimension D or to the spectrum of the Hessian; empirical parity is shown only for the specific (c,r) pairs chosen in the 0.1B–70B experiments.
- [§4] §4 (experimental setup): it is unclear whether the reported quality parity holds after controlling for post-hoc tuning of c and r on the evaluation sets or whether the low-rank structure of projected gradients degrades in particular layers or data regimes; without such controls the 20× speedup claim risks being tied to favorable hyper-parameter choices rather than a general property.
minor comments (2)
- Notation for the projection dimension D, rank c, and subspace dimension r should be introduced once in a dedicated paragraph and used consistently thereafter.
- [Tables and Figures] Figure captions and Table 1 should explicitly state the exact (c,r) values used for each model size so readers can reproduce the storage and latency numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential impact of LoRIF on scaling gradient-based training data attribution. We address each major comment below with clarifications and proposed revisions.
read point-by-point responses
-
Referee: [Abstract and §3] the central claim that rank-c truncation plus r-dimensional Hessian approximation yields attribution scores whose ranking and magnitude remain faithful to the full influence function is load-bearing, yet no error bound or sensitivity analysis is supplied that relates the truncation parameters (c,r) to projection dimension D or to the spectrum of the Hessian; empirical parity is shown only for the specific (c,r) pairs chosen in the 0.1B–70B experiments.
Authors: We agree that a formal error bound relating approximation error to c, r, D, and the Hessian spectrum would provide stronger theoretical grounding. Deriving such bounds is non-trivial given the dependence on the empirical Hessian spectrum and data distribution. Our defense rests on the extensive empirical results across three orders of magnitude in model size (0.1B–70B), where the selected (c,r) pairs consistently match or exceed LoGRA quality. In revision we will add a sensitivity study (new subsection in §3) plotting attribution metrics (e.g., Kendall-τ and top-k overlap) versus c and r on a representative 1B model, together with a brief discussion of the observed stability regimes. revision: partial
-
Referee: [§4] it is unclear whether the reported quality parity holds after controlling for post-hoc tuning of c and r on the evaluation sets or whether the low-rank structure of projected gradients degrades in particular layers or data regimes; without such controls the 20× speedup claim risks being tied to favorable hyper-parameter choices rather than a general property.
Authors: The hyperparameters c and r were fixed prior to final evaluation using only a small held-out validation split that does not overlap with any reported test sets; no post-hoc tuning on evaluation data occurred. To further address concerns about layer- or regime-specific degradation, the revised manuscript will include per-layer attribution quality breakdowns and results on stratified data subsets (e.g., by difficulty or domain). These additions will confirm that the low-rank factors preserve fidelity uniformly, supporting that the reported speedups are not artifacts of particular hyper-parameter selections. revision: yes
Circularity Check
No circularity: approximations use external matrix identities
full rationale
The paper's derivation applies the Woodbury matrix identity and truncated SVD to projected gradients and the Hessian inverse. These are standard, externally verifiable linear-algebra facts independent of the paper's data or fitted parameters. Storage and query reductions follow directly from the rank-c and r-dimensional factorizations without redefining any quantity in terms of its own output. No self-citation chain, ansatz smuggling, or fitted-input-as-prediction pattern appears in the derivation steps. The empirical quality claims are presented as validation rather than definitional tautologies.
Axiom & Free-Parameter Ledger
free parameters (2)
- rank c
- subspace dimension r
axioms (1)
- standard math Woodbury matrix identity holds for the low-rank update form of the Hessian approximation
Reference graph
Works this paper leans on
-
[1]
Sample subsets:Generate M random subsets {S m}M m=1 of the training data, each containing a fraction α of the full dataset. 2.Compute outputs:For each queryx query and subsetS m: •Actual output:Retrain a modelθ m onS m and computey actual m =f(x query;θ m). •Predicted output:Sum the attribution scores:y predicted m =P x(i)∈S m τ(x query,S) (i). 3.Compute ...
work page 2024
-
[2]
For LoRIF, this includes solving the rank-c factorization via power iteration
Stage 1: Gradient computation and storage.Compute and save the (projected) per-example gradients for all training examples. For LoRIF, this includes solving the rank-c factorization via power iteration. For EK-FAC, this stage instead computes and saves the covariances and eigenvalue corrections per layer (see (Grosse et al., 2023) for details); EK-FAC doe...
work page 2023
-
[3]
Stage 2: Inverse Hessian approximation.For LoGRA, form and store (G⊤G+λI) −1 per layer. For LoRIF, perform randomized SVD to obtainV r andΣ r, then store them. Tables 5, 6, and 7 report preprocessing times for each model. Runtime depends on many uncontrolled factors and can vary; we report representative measurements. We make the following observations: •...
-
[4]
A is sitting opposite to D
-
[5]
B is sitting opposite to F
-
[6]
We need to find out how many different ways the original seating arrangement could have been set up
When B and D swap places, the position of C relative to A doesn’t change. We need to find out how many different ways the original seating arrangement could have been set up. First things first, since it’s a circular table, we can fix the position of one person to eliminate rotational symmetry. Usually, in circular permutation problems, we can fix one per...
-
[7]
This language difference is a fundamental cultural distinction between the two countries
Language: While both countries are predominantly Spanish-speaking, Brazil is the only country in Latin America where Portuguese is the official language. This language difference is a fundamental cultural distinction between the two countries
-
[8]
History and Influences: Brazil was a Portuguese colony, and its culture reflects this in its language, religion (predominantly Roman Catholic), and some customs. Mexico, on the other hand, was colonized by Spain, and its culture reflects Spanish influences, including language, religion, and many traditions
-
[9]
Religion: While both countries are predominantly Catholic, the way Catholicism is practiced and integrated into daily life can differ. In Mexico, for example, the Day of the Dead is a significant cultural and religious celebration, whereas in Brazil, the Carnival (Rio de Janeiro’s Carnival being the most famous) is a major cultural event that has roots in...
-
[10]
Food: The cuisine of both countries is rich and diverse, but there are some distinct differences. Brazilian cuisine is known for its use of beans, rice, and meat, with dishes like feijoada and churrasco being popular. Mexican cuisine, on the other hand, is known for its use of corn, beans, and chili peppers, with dishes like tacos, enchiladas, and mole be...
-
[11]
Music and Dance: Brazil is famous for its vibrant music and dance culture, particularly samba and bossa nova. Mexico has its own rich musical traditions, including mariachi, ranchera, and son, each with its own style of dance
-
[12]
Festivals and Celebrations: Both countries have numerous festivals and celebrations throughout the year, but the nature and focus of these events can vary. Brazil’s Carnival is a world-renowned event, while Mexico’s Day of the Dead and Cinco de Mayo are also internationally recognized
-
[13]
Social Norms and Values: While both cultures value family and community, there can be differences in social norms and values. For example, ... [truncated] Top-1 Retrieved by LoGRA [User] Given what’s known now, what is the probability that the NDE evidence indicates that souls exist? [Assistant]The question of whether near-death experiences (NDEs) indicat...
-
[14]
is the square (R, C) in the rectangle picked?
Neurochemistry: Changes in brain chemistry during life-threatening situations or trauma can lead to hallucinations or altered states of consciousness. 2. Lack of Oxygen: Hypoxia or anoxia (a lack or complete absence of oxygen, respectively) can cause visual phenomena and other sensory distortions. 3. Psychological Reactions: The brain may produce certain ...
-
[15]
**Ask about the corner points**: Banana can ask if the four corner points of the rectangle are part of the set. This will help her determine if the rectangle is a square or not, but it will not directly give her the side length
-
[16]
Yes", then the rectangle is 100 by 100. - If Ana says
**Ask about the length and width**: To determine the side length, Banana can ask about the coordinates of a set of points that would correspond to the corners of a rectangle of a particular size. - For example, if Banana asks if the points (1,1),(1,100),(100,1),(100,100) are in the rectangle, and Ana says "Yes", then the rectangle is 100 by 100. - If Ana ...
-
[17]
- Three quadrants are painted entirely blue
**Division and Painting**: - The square is divided into four equal quadrants. - Three quadrants are painted entirely blue. - The fourth quadrant is cut along its diagonal into two right triangles. These triangles are painted such that one is blue and the other is red
-
[18]
- Each arrangement results in a different color configuration for the corners of the fourth quadrant
**Random Arrangement**: - The two triangles from the fourth quadrant can be arranged in two possible ways: either the original orientation or the flipped orientation. - Each arrangement results in a different color configuration for the corners of the fourth quadrant. ... [truncated] Figure 14.Qualitative comparison on Apertus-70B (Example 3). 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.