How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Hao Lu; Jiachen T. Wang; Jiaqi W. Ma; Junwei Deng; Pingbang Hu; Shichang Zhang; Suliang Jin

arxiv: 2605.18814 · v1 · pith:TSAJL7SVnew · submitted 2026-05-12 · 💻 cs.LG

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Junwei Deng , Pingbang Hu , Suliang Jin , Hao Lu , Jiachen T. Wang , Shichang Zhang , Jiaqi W. Ma This is my paper

Pith reviewed 2026-05-20 23:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords data attributioninfluence estimationAdamW optimizertrajectory methodsdata selectionerror analysismachine learning

0 comments

The pith

Accounting for AdamW dynamics makes trajectory-based data attribution match ground-truth influence far more closely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes sources of error in methods that estimate training-sample influence by unrolling a model's training path. It separates those errors into configuration choices, algorithmic approximations, and system factors, showing that the biggest configuration mismatch arises when methods assume plain SGD but the model was actually trained with AdamW. The authors introduce AdamW-influence, which follows the exact momentum and second-moment updates of AdamW, and demonstrate that this single change raises Spearman correlation with true influence scores by between 10 and 300 percent on MLPs, CNNs, GPT-2, and Llama 3.2-1B. They also supply a closed-form proxy for the remaining first-order approximation error and show how to unify offline and online data selection inside a short-horizon look-ahead scheme.

Core claim

Trajectory-based attribution is unfaithful primarily because existing formulas assume SGD while modern models use AdamW; deriving influence scores that exactly mirror AdamW's update rules removes this dominant error and produces substantially higher agreement with the ground-truth effect of removing each training point.

What carries the argument

AdamW-influence, the trajectory-unrolling formula that replaces the SGD gradient step with AdamW's bias-corrected first- and second-moment updates inside the influence calculation.

Load-bearing premise

The total attribution error decomposes cleanly into independent configuration, algorithm, and system components, with optimizer mismatch as the largest configuration term.

What would settle it

Leave-one-out retraining on the same AdamW-trained models to obtain exact influence values; if Spearman correlation with AdamW-influence scores does not rise markedly above prior methods, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.18814 by Hao Lu, Jiachen T. Wang, Jiaqi W. Ma, Junwei Deng, Pingbang Hu, Shichang Zhang, Suliang Jin.

**Figure 1.** Figure 1: An error taxonomy of trajectory-based attribution and our three key contributions. (a) Three error sources and how their proportion changes along the training trajectory. The absolute error of SGD-influence on MNIST+MLP decomposes into config-level (green, optimizer mismatch), algorithm-level (blue, first-order Taylor), and system-level (gray) components. (b) The proposed AdamW-influence algorithm corrects… view at source ↗

**Figure 2.** Figure 2: reports this decomposition on an MLP trained on MNIST across three learning rates (10−3 , 10−4 , 10−5 ), with errors aggregated into bins of 5 training steps (full setup in Appendix D). Two patterns stand out. (1) Once optimizer mismatch is corrected, the update-estimation error accounts for the overwhelming majority of the remaining error, while the higher-order residual is consistently small. We thus foc… view at source ↗

**Figure 3.** Figure 3: Error norm and intra-step Spearman correlation across three learning rates, on (a, b) an MLP and (c, d) a CNN trained on MNIST. Panels (a, c) report error norm (↓); panels (b, d) report intra-step Spearman correlation (↑, smoothed over 5 consecutive steps). The training step index runs from 0 to T − 1; larger indices correspond to shorter trajectory lengths |T − t ∗ |, since the perturbation has fewer rema… view at source ↗

**Figure 4.** Figure 4: The error proxy tracks the ground-truth error norm ∥∆θ−AdamW-influence∥ across training samples. Each point is a single training sample; axes are on log-log scale. Panels: (a) MLP, η = 10−3 ; (b) MLP, η = 10−4 ; (c) CNN, η = 10−3 ; (d) CNN, η = 10−4 . Both architectures are trained on MNIST. The red dashed line is a linear fit; insets report its slope and R2 , along with the Spearman rank correlation ρ bet… view at source ↗

**Figure 5.** Figure 5: Online AdamW-influence consistently outperforms online SGD-influence, and matches or exceeds offline AdamW-influence across four settings. (a) an MLP and (b) a CNN trained on MNIST (error rate, ↓); Also, Llama 3.2-1B fine-tuned on (c) Tulu3 and (d) Alpaca (perplexity, ↓). For MNIST settings, “Val” is at the best epoch and “Test” is reported at that epoch; for Llama 3.2-1B settings, both are at the final st… view at source ↗

**Figure 6.** Figure 6: The optimal K increases as the learning rate decreases. We sweep K ∈ {2, 5, 10, 25} across three learning rates using online AdamW-influence for MLP on MNIST. Bars show test error rate at the best validation epoch; the best K per learning rate is highlighted in red. Optimizer alignment transfers to downstream selection. Online AdamW-influence outperforms online SGD-influence in all settings ( [PITH_FULL_… view at source ↗

read the original abstract

Trajectory-based data attribution methods estimate the influence of training samples on model predictions by unrolling the training trajectory. They are widely used in applications such as data selection, data valuation, and model diagnosis, but there is a lack of comprehensive error analysis of these methods, raising concerns about method faithfulness and hindering reliable deployment. In this work, we provide the first systematic analysis of error sources in trajectory-based data attribution, together with concrete remedies to mitigate them and practical guidelines for downstream use. We organize the total error into three categories, config-level, algorithm-level, and system-level. We make three contributions. First, we identify optimizer mismatch as the dominant config-level error: existing methods derive their attribution under the assumption of SGD, even for models trained with the modern de facto optimizer AdamW. We propose AdamW-influence to fully account for AdamW's optimization dynamics, yielding improvements from 10% to over 300% in Spearman correlation between estimated and ground-truth influence across four settings spanning MLP, CNN, GPT-2, and Llama 3.2-1B. Second, we isolate the remaining algorithm-level error arising from the first-order Taylor approximation, identify the learning rate and trajectory length as factors governing the error magnitude, and derive a closed-form error proxy that can be evaluated along the original trajectory without retraining. Third, we translate these insights into practical guidelines for data selection by unifying offline and online strategies under a K-step look-ahead framework. Under this framework, online selection with a short horizon often matches or exceeds offline, and the optimal horizon can be tuned jointly with the learning rate. Together, these results turn the framework into an actionable selection recipe for practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully flags optimizer mismatch as the main error in trajectory attribution and supplies fixes that improve correlations substantially, though the error decomposition may not be as separable as claimed.

read the letter

The main thing to know is that this paper identifies optimizer mismatch as the biggest source of error in trajectory-based data attribution when models are trained with AdamW instead of SGD, and their AdamW-influence fix improves Spearman correlations from 10% up to over 300% in tests on MLPs, CNNs, GPT-2, and Llama. They organize the total error into config-level, algorithm-level, and system-level pieces and give concrete remedies plus guidelines for data selection. The closed-form error proxy for the Taylor approximation stands out because it runs along the existing trajectory without retraining. The K-step look-ahead unification that connects offline and online selection is also practical, showing that short-horizon online often matches or beats full offline and that the horizon can be tuned with learning rate. These pieces turn the method into something more reliable for practitioners doing data selection or diagnosis. The soft spot is the assumption that the three error categories are cleanly separable. The stress-test note is on point here: the trajectory itself is produced by AdamW, so any proxy derived under SGD assumptions will be evaluated on a different path once the optimizer is corrected. The paper treats the categories as independent with separate fixes, but that coupling could make the decomposition less clean than presented. The abstract reports solid correlation gains, yet the full derivations would need to confirm the proxy still holds after the dynamics change. This is aimed at ML researchers and engineers who use influence estimates for data valuation or pruning with modern optimizers. A reader who needs actionable fixes for non-SGD settings will get direct value. It has enough new formulations and empirical checks to deserve a serious referee, even if the independence claim needs tightening in review. I would recommend sending it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper provides the first systematic analysis of error sources in trajectory-based data attribution methods, organizing total error into config-level (primarily optimizer mismatch between SGD assumptions and AdamW training), algorithm-level (first-order Taylor approximation), and system-level categories. It proposes AdamW-influence to correct for AdamW dynamics, yielding Spearman correlation improvements from 10% to over 300% across MLP, CNN, GPT-2, and Llama 3.2-1B settings; derives a closed-form error proxy for the Taylor approximation governed by learning rate and trajectory length; and unifies offline/online data selection under a K-step look-ahead framework with practical guidelines.

Significance. If the central claims hold, this work meaningfully advances the reliability of trajectory-based attribution for data selection, valuation, and diagnosis by quantifying and mitigating dominant error sources in modern training regimes. The empirical gains with AdamW-influence, the retraining-free error proxy, and the K-step unification provide both theoretical insight and actionable recipes; the cross-model evaluation spanning small MLPs to 1B-scale LLMs strengthens generalizability.

major comments (2)

[§3] §3 (error decomposition): The organization of total error into independent config-level, algorithm-level, and system-level components treats these as separable so that AdamW-influence can be applied in isolation, but the training trajectory is itself generated by AdamW; any first-order Taylor proxy derived under SGD dynamics is therefore evaluated on a different path, creating potential coupling that is not analyzed when proposing independent remedies.
[§4.2] §4.2 (closed-form error proxy): The claim that the proxy can be evaluated along the original trajectory without retraining relies on the first-order approximation remaining valid after the AdamW correction; if the optimizer change alters higher-order terms, the proxy's accuracy and the identified governing factors (learning rate, trajectory length) require explicit validation on AdamW trajectories.

minor comments (2)

[§5] Experimental details on how ground-truth influence is computed (e.g., exact leave-one-out or retraining protocol) are referenced but not fully specified in the main text, making it difficult to assess whether the reported Spearman gains are robust to alternative ground-truth definitions.
[Figures 3-5] Table captions and axis labels in the correlation plots should explicitly state the number of runs and seeds used to compute the reported improvements, to clarify statistical reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our error decomposition and the closed-form proxy. We address each major comment below and outline the changes we will make.

read point-by-point responses

Referee: [§3] §3 (error decomposition): The organization of total error into independent config-level, algorithm-level, and system-level components treats these as separable so that AdamW-influence can be applied in isolation, but the training trajectory is itself generated by AdamW; any first-order Taylor proxy derived under SGD dynamics is therefore evaluated on a different path, creating potential coupling that is not analyzed when proposing independent remedies.

Authors: We agree that the components are coupled through the shared AdamW-generated trajectory and that a fully independent treatment is an approximation. Our analysis isolates the dominant config-level mismatch (optimizer mismatch), which AdamW-influence corrects directly on the observed trajectory; the algorithm-level Taylor proxy is then applied to the corrected estimates. While we did not derive a joint higher-order expansion of the coupled errors, the large empirical gains (10% to >300% Spearman correlation) indicate that addressing the primary mismatch yields reliable improvements in practice. We will add an explicit discussion of this coupling and its limitations in the revised manuscript. revision: partial
Referee: [§4.2] §4.2 (closed-form error proxy): The claim that the proxy can be evaluated along the original trajectory without retraining relies on the first-order approximation remaining valid after the AdamW correction; if the optimizer change alters higher-order terms, the proxy's accuracy and the identified governing factors (learning rate, trajectory length) require explicit validation on AdamW trajectories.

Authors: The error proxy is obtained from a first-order Taylor expansion whose leading terms depend on the learning rate and trajectory length; these quantities are directly observable along the original (AdamW) trajectory. Because AdamW-influence already aligns the influence function with the actual optimizer, the proxy is meant to be evaluated after this correction. To confirm that higher-order terms do not materially affect the proxy's accuracy or the identified governing factors, we will add explicit validation experiments that compare proxy predictions against measured errors on AdamW-trained trajectories in the revision. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no reduction to inputs by construction

full rationale

The paper organizes error into config-, algorithm-, and system-level categories as an analytical framework, proposes AdamW-influence to correct optimizer mismatch, and derives a closed-form proxy for first-order Taylor error that is evaluated along the existing trajectory without retraining. These steps are validated via external Spearman correlations to ground-truth influence on held-out models (MLP through Llama 3.2-1B), not by fitting parameters that are then renamed as predictions. Self-citations to prior trajectory-based attribution work supply background but are not invoked as uniqueness theorems or load-bearing premises that force the new results; the K-step look-ahead unification follows directly from the identified factors (learning rate, horizon) and remains testable independently. No equation or claim reduces to its own inputs by definition or statistical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis assumes that influence can be meaningfully estimated via unrolled trajectories and that ground-truth influence exists as a measurable quantity for validation. The Taylor approximation error proxy relies on the validity of the first-order expansion along the training path. No new physical entities are introduced.

axioms (2)

domain assumption Training dynamics can be approximated by unrolling the optimization trajectory under the chosen optimizer.
Invoked when defining trajectory-based attribution and when proposing AdamW-influence to match actual training.
domain assumption The first-order Taylor approximation error is governed primarily by learning rate and trajectory length.
Used to derive the closed-form error proxy.

pith-pipeline@v0.9.0 · 5863 in / 1532 out tokens · 37163 ms · 2026-05-20T23:08:22.241438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

The Llama 3 Herd of Models

Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URLhttps://www.aclweb.org/anthology/D19-5409. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-5409
[3]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

work page arXiv
[4]

Simfluence: Modeling the influence of individual training examples by simulating training runs.arXiv preprint arXiv:2303.08114,

Kelvin Guu, Albert Webson, Ellie Pavlick, Lucas Dixon, Ian Tenney, and Tolga Bolukbasi. Simfluence: Modeling the influence of individual training examples by simulating training runs.arXiv preprint arXiv:2303.08114,

work page arXiv
[5]

Andrew Ilyas and Logan Engstrom

URL https://openreview.net/forum?id= sYK4yPDuT1. Andrew Ilyas and Logan Engstrom. Magic: Near-optimal data attribution for deep learning.arXiv preprint arXiv:2504.16430,

work page arXiv
[6]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Accumulative sgd influence estimation for data attribution

Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, and Min Xu. Accumulative sgd influence estimation for data attribution. arXiv preprint arXiv:2510.26185,

work page arXiv
[8]

Shichang Zhang, Hongzhe Du, Jiaqi W Ma, and Himabindu Lakkaraju

URL https://openreview.net/forum?id=6gzPSMUAz2. Shichang Zhang, Hongzhe Du, Jiaqi W Ma, and Himabindu Lakkaraju. Who gets credit or blame? attributing accountability in modern ai systems.arXiv preprint arXiv:2506.00175,

work page arXiv
[9]

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines Deng et al

diag(gt)   .(11b) Mt collects the Hessian-free terms (momentum decay and weight decay), while Rt collects the Hessian-mediated coupling between optimizer states. How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines Deng et al. B.4 Backward recurrence for the summary matrix Analogous to SGD-influence, we ...

work page 2019
[10]

We also consider additional settings presented in Appendix E

3× D Experiment details D.1 Experimental settings for fidelity evaluation In section 3.2, we consider four experiment settings to evaluate the attribution fidelity of AdamW-influence. We also consider additional settings presented in Appendix E. • MNIST+MLP.We train a three-layer multilayer perceptron (MLP) with hidden layer sizes of 16, consisting of two...

work page 2002
[11]

We train for 1 epoch and evaluate attribution fidelity on 100 test samples from the SAMSum summarization task (Gliwa et al., 2019)

on the Alpaca instruction-tuning dataset (Taori et al., 2023), using the first512 training examples with a maximum sequence length of 512 tokens. We train for 1 epoch and evaluate attribution fidelity on 100 test samples from the SAMSum summarization task (Gliwa et al., 2019). We use the AdamW optimizer and sweep learning rates spanning 5×10 −7, 2×10 −6, ...

work page 2023
[12]

Both belong to a single second-order remainder off t,i and are not independent error sources

The first term originates from the loss nonlinearity (∇3ℓ) entering ¨ˆmt,i and is present even for non-adaptive optimizers; the second originates from the quadratic dependence of ˆvt on gradient perturbations and is specific to Adam-family optimizers. Both belong to a single second-order remainder off t,i and are not independent error sources. How Faithfu...

work page 2024

[1] [1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

The Llama 3 Herd of Models

Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URLhttps://www.aclweb.org/anthology/D19-5409. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-5409

[3] [3]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

work page arXiv

[4] [4]

Simfluence: Modeling the influence of individual training examples by simulating training runs.arXiv preprint arXiv:2303.08114,

Kelvin Guu, Albert Webson, Ellie Pavlick, Lucas Dixon, Ian Tenney, and Tolga Bolukbasi. Simfluence: Modeling the influence of individual training examples by simulating training runs.arXiv preprint arXiv:2303.08114,

work page arXiv

[5] [5]

Andrew Ilyas and Logan Engstrom

URL https://openreview.net/forum?id= sYK4yPDuT1. Andrew Ilyas and Logan Engstrom. Magic: Near-optimal data attribution for deep learning.arXiv preprint arXiv:2504.16430,

work page arXiv

[6] [6]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Accumulative sgd influence estimation for data attribution

Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, and Min Xu. Accumulative sgd influence estimation for data attribution. arXiv preprint arXiv:2510.26185,

work page arXiv

[8] [8]

Shichang Zhang, Hongzhe Du, Jiaqi W Ma, and Himabindu Lakkaraju

URL https://openreview.net/forum?id=6gzPSMUAz2. Shichang Zhang, Hongzhe Du, Jiaqi W Ma, and Himabindu Lakkaraju. Who gets credit or blame? attributing accountability in modern ai systems.arXiv preprint arXiv:2506.00175,

work page arXiv

[9] [9]

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines Deng et al

diag(gt)   .(11b) Mt collects the Hessian-free terms (momentum decay and weight decay), while Rt collects the Hessian-mediated coupling between optimizer states. How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines Deng et al. B.4 Backward recurrence for the summary matrix Analogous to SGD-influence, we ...

work page 2019

[10] [10]

We also consider additional settings presented in Appendix E

3× D Experiment details D.1 Experimental settings for fidelity evaluation In section 3.2, we consider four experiment settings to evaluate the attribution fidelity of AdamW-influence. We also consider additional settings presented in Appendix E. • MNIST+MLP.We train a three-layer multilayer perceptron (MLP) with hidden layer sizes of 16, consisting of two...

work page 2002

[11] [11]

We train for 1 epoch and evaluate attribution fidelity on 100 test samples from the SAMSum summarization task (Gliwa et al., 2019)

on the Alpaca instruction-tuning dataset (Taori et al., 2023), using the first512 training examples with a maximum sequence length of 512 tokens. We train for 1 epoch and evaluate attribution fidelity on 100 test samples from the SAMSum summarization task (Gliwa et al., 2019). We use the AdamW optimizer and sweep learning rates spanning 5×10 −7, 2×10 −6, ...

work page 2023

[12] [12]

Both belong to a single second-order remainder off t,i and are not independent error sources

The first term originates from the loss nonlinearity (∇3ℓ) entering ¨ˆmt,i and is present even for non-adaptive optimizers; the second originates from the quadratic dependence of ˆvt on gradient perturbations and is specific to Adam-family optimizers. Both belong to a single second-order remainder off t,i and are not independent error sources. How Faithfu...

work page 2024