pith. sign in

arxiv: 1906.08473 · v1 · pith:TIBGC4SKnew · submitted 2019-06-20 · 📊 stat.ML · cs.LG

Data Cleansing for Models Trained with SGD

Pith reviewed 2026-05-25 19:33 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords data cleansinginfluential instancesstochastic gradient descentmachine learningmodel improvementMNISTCIFAR10
0
0 comments X

The pith

Retracing SGD steps identifies influential training instances for data cleansing without convexity or optimality assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an algorithm that suggests which training instances most affect a model's predictions when training uses stochastic gradient descent. It works by following the sequence of model updates produced during SGD and incorporating the intermediate models saved at each step to score influence. This removes the need for domain expertise, convex loss functions, or an optimal model, limitations of earlier approaches. Experiments show the method accurately recovers influential instances and that deleting them raises accuracy on MNIST and CIFAR10.

Core claim

The paper claims that influential instances can be inferred by retracing the SGD optimization trajectory while incorporating the intermediate models computed in each step, enabling effective data cleansing for models trained with non-convex losses and without reaching an optimum.

What carries the argument

The retracing procedure that follows each SGD update and uses the sequence of intermediate models to compute influence scores.

If this is right

  • Models trained with SGD can be improved simply by inspecting and removing the instances the algorithm flags.
  • The method applies directly to non-convex models common in deep learning.
  • Data cleansing becomes feasible for users without specialized domain knowledge.
  • Experiments on image datasets confirm that the flagged instances are genuinely influential.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retracing idea could be tested on other first-order optimizers such as Adam.
  • The approach might surface systematic biases in a dataset by repeatedly identifying the same examples as influential.
  • For very large datasets the method could be paired with cheaper approximations to keep computation feasible.

Load-bearing premise

That retracing the SGD path with intermediate models correctly identifies which instances are influential, even for non-convex losses and without an optimal model.

What would settle it

Removing the instances flagged by the method produces no improvement or a decline in validation accuracy, unlike removal of randomly chosen instances of the same number.

Figures

Figures reproduced from arXiv: 1906.08473 by Atsushi Nitanda, Satoshi Hara, Takanori Maehara.

Figure 1
Figure 1. Figure 1: Estimated linear influences for linear logistic regression (LogReg) and deep neural networks (DNN) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average misclassification rates on the test set after data cleansing. The errorbars are omitted for [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of found influential instances and their labels in (a)(b) MNIST and (c)(d) CIFAR10. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Structures of convolutional neural networks (CNNs) [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Structures of Autoencoders 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MNIST: Average misclassification rates on the test set after data cleansing over 30 experiments. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Exhaustive results on MNIST: [Thick lines] Average misclassification rates on the test set after [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CIFAR10: Average misclassification rates on the test set after data cleansing over 30 experiments. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Exhaustive results on CIFAR10: [Thick lines] Average misclassification rates on the test set after [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of the misclassification rates before and after the data cleansing with the proposed [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of found top-20 influential instances in MNIST [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of found top-20 influential instances in CIFAR10 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes an algorithm to identify influential training instances for data cleansing in models trained via SGD. The method retraces the SGD trajectory by incorporating intermediate models at each step, avoiding requirements for convex losses or globally optimal models. Experiments claim to show accurate inference of influential instances, with removal of suggested instances improving accuracy on MNIST and CIFAR10.

Significance. If the empirical results hold under non-convex training, the heuristic offers a practical, domain-knowledge-free approach to data cleansing for modern deep learning pipelines where influence functions are inapplicable. It could enable non-experts to debug training data and improve models without convexity assumptions.

minor comments (3)
  1. [Experiments] Experiments section: the abstract states that experiments demonstrate 'accurate inference' and model improvement, but quantitative metrics (e.g., precision of identified instances, comparison to baselines such as random removal or standard influence functions), dataset splits, and exact evaluation protocol are needed to substantiate the central empirical claim.
  2. [Method] Method description: clarify the precise procedure for storing and reusing intermediate models during retracing, including any approximations used to control memory and compute cost when the number of SGD steps is large.
  3. Notation: ensure consistent use of symbols for the sequence of intermediate parameters and the influence score definition across equations and pseudocode.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee correctly notes that our method avoids convexity assumptions by retracing the SGD trajectory with intermediate models, which is the core contribution for practical data cleansing in modern non-convex settings.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an explicit algorithmic procedure that retraces SGD steps using the sequence of intermediate models produced during training. This construction is presented as a direct heuristic extension of SGD mechanics to the non-convex, non-optimal setting; it does not rely on fitting parameters to a target quantity and then relabeling the fit as a prediction, nor on any self-citation chain that would render the central claim tautological. The experimental claims (accurate identification on MNIST/CIFAR-10 and accuracy gains after removal) are external empirical checks rather than internal reductions. No load-bearing step reduces by definition or by self-reference to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that retracing the SGD trajectory with intermediate models captures instance influence for non-convex cases.

axioms (1)
  • domain assumption Influential instances can be identified by retracing the SGD trajectory using intermediate models
    This is the core premise enabling the method to work without convexity or optimality.

pith-pipeline@v0.9.0 · 5723 in / 1224 out tokens · 33164 ms · 2026-05-25T19:33:33.329771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Outlier Analysis Second Edition

    Charu C Aggarwal. Outlier Analysis Second Edition . Springer, 2016

  2. [2]

    The distribution of an arbitrary studentized residual and the effects of updating in multiple regression

    RJ Beckman and HJ Trussell. The distribution of an arbitrary studentized residual and the effects of updating in multiple regression. Journal of the American Statistical Association , 69(345):199--201, 1974

  3. [3]

    Lof: identifying density-based local outliers

    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J \"o rg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data , volume 29, pages 93--104. ACM, 2000

  4. [4]

    Characterizations of an empirical influence function for detecting influential cases in regression

    R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics , 22(4):495--508, 1980

  5. [5]

    Detection of influential observation in linear regression

    R Dennis Cook. Detection of influential observation in linear regression. Technometrics , 19(1):15--18, 1977

  6. [6]

    UCI machine learning repository, 2017

    Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017

  7. [7]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013

  8. [8]

    Interpreting black box predictions using fisher kernels

    Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. Interpreting black box predictions using fisher kernels. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3382--3390, 2019

  9. [9]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning , pages 1885--1894, 2017

  10. [10]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

  11. [11]

    Gradient-based learning applied to document recognition

    Yann LeCun, L \'e on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278--2324, 1998

  12. [12]

    Isolation forest

    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining , pages 413--422. IEEE, 2008

  13. [13]

    Machine learning yearning, 2017

    Andrew Ng. Machine learning yearning, 2017

  14. [14]

    Logistic regression diagnostics

    Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics , 9(4):705--724, 1981

  15. [15]

    Unsupervised anomaly detection with generative adversarial networks to guide marker discovery

    Thomas Schlegl, Philipp Seeb \"o ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging , pages 146--157. Springer, 2017

  16. [16]

    Estimating the support of a high-dimensional distribution

    Bernhard Sch \"o lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation , 13(7):1443--1471, 2001

  17. [17]

    Training set debugging using trusted items

    Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , pages 4482--4489, 2018

  18. [18]

    Anomaly detection with robust deep autoencoders

    Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 665--674. ACM, 2017

  19. [19]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...