Data Cleansing for Models Trained with SGD

Atsushi Nitanda; Satoshi Hara; Takanori Maehara

arxiv: 1906.08473 · v1 · pith:TIBGC4SKnew · submitted 2019-06-20 · 📊 stat.ML · cs.LG

Data Cleansing for Models Trained with SGD

Satoshi Hara , Atsushi Nitanda , Takanori Maehara This is my paper

Pith reviewed 2026-05-25 19:33 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords data cleansinginfluential instancesstochastic gradient descentmachine learningmodel improvementMNISTCIFAR10

0 comments

The pith

Retracing SGD steps identifies influential training instances for data cleansing without convexity or optimality assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an algorithm that suggests which training instances most affect a model's predictions when training uses stochastic gradient descent. It works by following the sequence of model updates produced during SGD and incorporating the intermediate models saved at each step to score influence. This removes the need for domain expertise, convex loss functions, or an optimal model, limitations of earlier approaches. Experiments show the method accurately recovers influential instances and that deleting them raises accuracy on MNIST and CIFAR10.

Core claim

The paper claims that influential instances can be inferred by retracing the SGD optimization trajectory while incorporating the intermediate models computed in each step, enabling effective data cleansing for models trained with non-convex losses and without reaching an optimum.

What carries the argument

The retracing procedure that follows each SGD update and uses the sequence of intermediate models to compute influence scores.

If this is right

Models trained with SGD can be improved simply by inspecting and removing the instances the algorithm flags.
The method applies directly to non-convex models common in deep learning.
Data cleansing becomes feasible for users without specialized domain knowledge.
Experiments on image datasets confirm that the flagged instances are genuinely influential.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retracing idea could be tested on other first-order optimizers such as Adam.
The approach might surface systematic biases in a dataset by repeatedly identifying the same examples as influential.
For very large datasets the method could be paired with cheaper approximations to keep computation feasible.

Load-bearing premise

That retracing the SGD path with intermediate models correctly identifies which instances are influential, even for non-convex losses and without an optimal model.

What would settle it

Removing the instances flagged by the method produces no improvement or a decline in validation accuracy, unlike removal of randomly chosen instances of the same number.

Figures

Figures reproduced from arXiv: 1906.08473 by Atsushi Nitanda, Satoshi Hara, Takanori Maehara.

**Figure 2.** Figure 2: Average misclassification rates on the test set after data cleansing. The errorbars are omitted for [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of found influential instances and their labels in (a)(b) MNIST and (c)(d) CIFAR10. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Structures of convolutional neural networks (CNNs) [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Structures of Autoencoders 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: MNIST: Average misclassification rates on the test set after data cleansing over 30 experiments. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Exhaustive results on MNIST: [Thick lines] Average misclassification rates on the test set after [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: CIFAR10: Average misclassification rates on the test set after data cleansing over 30 experiments. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Exhaustive results on CIFAR10: [Thick lines] Average misclassification rates on the test set after [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of the misclassification rates before and after the data cleansing with the proposed [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of found top-20 influential instances in MNIST [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of found top-20 influential instances in CIFAR10 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can suggest influential instances without using any domain knowledge. With the proposed method, users only need to inspect the instances suggested by the algorithm, implying that users do not need extensive knowledge for this procedure, which enables even non-experts to conduct data cleansing and improve the model. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a direct way to flag influential training points in SGD models by retracing the optimization path, and the MNIST/CIFAR10 results show removal can lift accuracy without needing convexity or an optimum.

read the letter

The main takeaway is that this work supplies a practical algorithm for data cleansing that works on the non-convex, non-optimal models that SGD actually produces. It does this by walking backward through the recorded intermediate models and steps rather than relying on influence functions that assume convexity and convergence to a minimum. That is the concrete advance over earlier methods. The experiments back the claim on standard image benchmarks: the flagged points, when removed, improve test accuracy on both MNIST and CIFAR10, and the method appears to surface the right instances without domain knowledge. The approach is also cheap to run once the training trajectory is saved. Those are the parts that hold up. The evaluation section is thinner than ideal. The abstract states that the method infers influential instances accurately, yet the provided details do not spell out the quantitative criteria used to judge accuracy or the direct baselines against other influence estimators. Without those numbers it is hard to gauge how large the gain is or whether the retracing heuristic degrades on other architectures or losses. The method is presented as a heuristic rather than a guarantee, which is honest, but readers will still want to see failure cases or sensitivity to learning-rate schedules. This paper is aimed at practitioners who train deep models and want a lightweight way to audit their training sets. Anyone already storing checkpoints or intermediate gradients will find the implementation straightforward. It is solid enough on its own terms to merit referee time; the core construction is clear and the empirical demonstration is on real data. I would send it out for review rather than desk-reject.

Referee Report

0 major / 3 minor

Summary. The paper proposes an algorithm to identify influential training instances for data cleansing in models trained via SGD. The method retraces the SGD trajectory by incorporating intermediate models at each step, avoiding requirements for convex losses or globally optimal models. Experiments claim to show accurate inference of influential instances, with removal of suggested instances improving accuracy on MNIST and CIFAR10.

Significance. If the empirical results hold under non-convex training, the heuristic offers a practical, domain-knowledge-free approach to data cleansing for modern deep learning pipelines where influence functions are inapplicable. It could enable non-experts to debug training data and improve models without convexity assumptions.

minor comments (3)

[Experiments] Experiments section: the abstract states that experiments demonstrate 'accurate inference' and model improvement, but quantitative metrics (e.g., precision of identified instances, comparison to baselines such as random removal or standard influence functions), dataset splits, and exact evaluation protocol are needed to substantiate the central empirical claim.
[Method] Method description: clarify the precise procedure for storing and reusing intermediate models during retracing, including any approximations used to control memory and compute cost when the number of SGD steps is large.
Notation: ensure consistent use of symbols for the sequence of intermediate parameters and the influence score definition across equations and pseudocode.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee correctly notes that our method avoids convexity assumptions by retracing the SGD trajectory with intermediate models, which is the core contribution for practical data cleansing in modern non-convex settings.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an explicit algorithmic procedure that retraces SGD steps using the sequence of intermediate models produced during training. This construction is presented as a direct heuristic extension of SGD mechanics to the non-convex, non-optimal setting; it does not rely on fitting parameters to a target quantity and then relabeling the fit as a prediction, nor on any self-citation chain that would render the central claim tautological. The experimental claims (accurate identification on MNIST/CIFAR-10 and accuracy gains after removal) are external empirical checks rather than internal reductions. No load-bearing step reduces by definition or by self-reference to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that retracing the SGD trajectory with intermediate models captures instance influence for non-convex cases.

axioms (1)

domain assumption Influential instances can be identified by retracing the SGD trajectory using intermediate models
This is the core premise enabling the method to work without convexity or optimality.

pith-pipeline@v0.9.0 · 5723 in / 1224 out tokens · 33164 ms · 2026-05-25T19:33:33.329771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Outlier Analysis Second Edition

Charu C Aggarwal. Outlier Analysis Second Edition . Springer, 2016

work page 2016
[2]

The distribution of an arbitrary studentized residual and the effects of updating in multiple regression

RJ Beckman and HJ Trussell. The distribution of an arbitrary studentized residual and the effects of updating in multiple regression. Journal of the American Statistical Association , 69(345):199--201, 1974

work page 1974
[3]

Lof: identifying density-based local outliers

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J \"o rg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data , volume 29, pages 93--104. ACM, 2000

work page 2000
[4]

Characterizations of an empirical influence function for detecting influential cases in regression

R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics , 22(4):495--508, 1980

work page 1980
[5]

Detection of influential observation in linear regression

R Dennis Cook. Detection of influential observation in linear regression. Technometrics , 19(1):15--18, 1977

work page 1977
[6]

UCI machine learning repository, 2017

Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017

work page 2017
[7]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013

work page 2013
[8]

Interpreting black box predictions using fisher kernels

Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. Interpreting black box predictions using fisher kernels. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3382--3390, 2019

work page 2019
[9]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning , pages 1885--1894, 2017

work page 2017
[10]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009
[11]

Gradient-based learning applied to document recognition

Yann LeCun, L \'e on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278--2324, 1998

work page 1998
[12]

Isolation forest

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining , pages 413--422. IEEE, 2008

work page 2008
[13]

Machine learning yearning, 2017

Andrew Ng. Machine learning yearning, 2017

work page 2017
[14]

Logistic regression diagnostics

Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics , 9(4):705--724, 1981

work page 1981
[15]

Unsupervised anomaly detection with generative adversarial networks to guide marker discovery

Thomas Schlegl, Philipp Seeb \"o ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging , pages 146--157. Springer, 2017

work page 2017
[16]

Estimating the support of a high-dimensional distribution

Bernhard Sch \"o lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation , 13(7):1443--1471, 2001

work page 2001
[17]

Training set debugging using trusted items

Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , pages 4482--4489, 2018

work page 2018
[18]

Anomaly detection with robust deep autoencoders

Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 665--674. ACM, 2017

work page 2017
[19]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[1] [1]

Outlier Analysis Second Edition

Charu C Aggarwal. Outlier Analysis Second Edition . Springer, 2016

work page 2016

[2] [2]

The distribution of an arbitrary studentized residual and the effects of updating in multiple regression

RJ Beckman and HJ Trussell. The distribution of an arbitrary studentized residual and the effects of updating in multiple regression. Journal of the American Statistical Association , 69(345):199--201, 1974

work page 1974

[3] [3]

Lof: identifying density-based local outliers

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J \"o rg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data , volume 29, pages 93--104. ACM, 2000

work page 2000

[4] [4]

Characterizations of an empirical influence function for detecting influential cases in regression

R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics , 22(4):495--508, 1980

work page 1980

[5] [5]

Detection of influential observation in linear regression

R Dennis Cook. Detection of influential observation in linear regression. Technometrics , 19(1):15--18, 1977

work page 1977

[6] [6]

UCI machine learning repository, 2017

Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017

work page 2017

[7] [7]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341--2368, 2013

work page 2013

[8] [8]

Interpreting black box predictions using fisher kernels

Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. Interpreting black box predictions using fisher kernels. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3382--3390, 2019

work page 2019

[9] [9]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning , pages 1885--1894, 2017

work page 2017

[10] [10]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009

[11] [11]

Gradient-based learning applied to document recognition

Yann LeCun, L \'e on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278--2324, 1998

work page 1998

[12] [12]

Isolation forest

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining , pages 413--422. IEEE, 2008

work page 2008

[13] [13]

Machine learning yearning, 2017

Andrew Ng. Machine learning yearning, 2017

work page 2017

[14] [14]

Logistic regression diagnostics

Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics , 9(4):705--724, 1981

work page 1981

[15] [15]

Unsupervised anomaly detection with generative adversarial networks to guide marker discovery

Thomas Schlegl, Philipp Seeb \"o ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging , pages 146--157. Springer, 2017

work page 2017

[16] [16]

Estimating the support of a high-dimensional distribution

Bernhard Sch \"o lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation , 13(7):1443--1471, 2001

work page 2001

[17] [17]

Training set debugging using trusted items

Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence , pages 4482--4489, 2018

work page 2018

[18] [18]

Anomaly detection with robust deep autoencoders

Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 665--674. ACM, 2017

work page 2017

[19] [19]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page