Data Cleansing for Models Trained with SGD
Pith reviewed 2026-05-25 19:33 UTC · model grok-4.3
The pith
Retracing SGD steps identifies influential training instances for data cleansing without convexity or optimality assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that influential instances can be inferred by retracing the SGD optimization trajectory while incorporating the intermediate models computed in each step, enabling effective data cleansing for models trained with non-convex losses and without reaching an optimum.
What carries the argument
The retracing procedure that follows each SGD update and uses the sequence of intermediate models to compute influence scores.
If this is right
- Models trained with SGD can be improved simply by inspecting and removing the instances the algorithm flags.
- The method applies directly to non-convex models common in deep learning.
- Data cleansing becomes feasible for users without specialized domain knowledge.
- Experiments on image datasets confirm that the flagged instances are genuinely influential.
Where Pith is reading between the lines
- The same retracing idea could be tested on other first-order optimizers such as Adam.
- The approach might surface systematic biases in a dataset by repeatedly identifying the same examples as influential.
- For very large datasets the method could be paired with cheaper approximations to keep computation feasible.
Load-bearing premise
That retracing the SGD path with intermediate models correctly identifies which instances are influential, even for non-convex losses and without an optimal model.
What would settle it
Removing the instances flagged by the method produces no improvement or a decline in validation accuracy, unlike removal of randomly chosen instances of the same number.
Figures
read the original abstract
Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an algorithm to identify influential training instances for data cleansing in models trained via SGD. The method retraces the SGD trajectory by incorporating intermediate models at each step, avoiding requirements for convex losses or globally optimal models. Experiments claim to show accurate inference of influential instances, with removal of suggested instances improving accuracy on MNIST and CIFAR10.
Significance. If the empirical results hold under non-convex training, the heuristic offers a practical, domain-knowledge-free approach to data cleansing for modern deep learning pipelines where influence functions are inapplicable. It could enable non-experts to debug training data and improve models without convexity assumptions.
minor comments (3)
- [Experiments] Experiments section: the abstract states that experiments demonstrate 'accurate inference' and model improvement, but quantitative metrics (e.g., precision of identified instances, comparison to baselines such as random removal or standard influence functions), dataset splits, and exact evaluation protocol are needed to substantiate the central empirical claim.
- [Method] Method description: clarify the precise procedure for storing and reusing intermediate models during retracing, including any approximations used to control memory and compute cost when the number of SGD steps is large.
- Notation: ensure consistent use of symbols for the sequence of intermediate parameters and the influence score definition across equations and pseudocode.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee correctly notes that our method avoids convexity assumptions by retracing the SGD trajectory with intermediate models, which is the core contribution for practical data cleansing in modern non-convex settings.
Circularity Check
No significant circularity
full rationale
The paper's core contribution is an explicit algorithmic procedure that retraces SGD steps using the sequence of intermediate models produced during training. This construction is presented as a direct heuristic extension of SGD mechanics to the non-convex, non-optimal setting; it does not rely on fitting parameters to a target quantity and then relabeling the fit as a prediction, nor on any self-citation chain that would render the central claim tautological. The experimental claims (accurate identification on MNIST/CIFAR-10 and accuracy gains after removal) are external empirical checks rather than internal reductions. No load-bearing step reduces by definition or by self-reference to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Influential instances can be identified by retracing the SGD trajectory using intermediate models
Reference graph
Works this paper leans on
-
[1]
Outlier Analysis Second Edition
Charu C Aggarwal. Outlier Analysis Second Edition . Springer, 2016
work page 2016
-
[2]
RJ Beckman and HJ Trussell. The distribution of an arbitrary studentized residual and the effects of updating in multiple regression. Journal of the American Statistical Association , 69(345):199--201, 1974
work page 1974
-
[3]
Lof: identifying density-based local outliers
Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J \"o rg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data , volume 29, pages 93--104. ACM, 2000
work page 2000
-
[4]
Characterizations of an empirical influence function for detecting influential cases in regression
R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics , 22(4):495--508, 1980
work page 1980
-
[5]
Detection of influential observation in linear regression
R Dennis Cook. Detection of influential observation in linear regression. Technometrics , 19(1):15--18, 1977
work page 1977
-
[6]
UCI machine learning repository, 2017
Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017
work page 2017
-
[7]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013
work page 2013
-
[8]
Interpreting black box predictions using fisher kernels
Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. Interpreting black box predictions using fisher kernels. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3382--3390, 2019
work page 2019
-
[9]
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning , pages 1885--1894, 2017
work page 2017
-
[10]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009
work page 2009
-
[11]
Gradient-based learning applied to document recognition
Yann LeCun, L \'e on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278--2324, 1998
work page 1998
-
[12]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining , pages 413--422. IEEE, 2008
work page 2008
- [13]
-
[14]
Logistic regression diagnostics
Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics , 9(4):705--724, 1981
work page 1981
-
[15]
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery
Thomas Schlegl, Philipp Seeb \"o ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging , pages 146--157. Springer, 2017
work page 2017
-
[16]
Estimating the support of a high-dimensional distribution
Bernhard Sch \"o lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation , 13(7):1443--1471, 2001
work page 2001
-
[17]
Training set debugging using trusted items
Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , pages 4482--4489, 2018
work page 2018
-
[18]
Anomaly detection with robust deep autoencoders
Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 665--674. ACM, 2017
work page 2017
-
[19]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.