Preprocessing Methods and Pipelines of Data Mining: An Overview
Pith reviewed 2026-05-25 19:58 UTC · model grok-4.3
The pith
Data mining model quality depends more on input data quality than on model robustness techniques.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that data preprocessing, organized into the categories of cleaning, transformation, and reduction, forms an indispensable stage in the data mining pipeline whose proper execution determines the quality of the knowledge ultimately obtained.
What carries the argument
The three-way categorization of preprocessing into data cleaning, data transformation, and data reduction, which organizes the methods shown to influence downstream model behavior.
Load-bearing premise
That the standard split into cleaning, transformation, and reduction plus the methods listed in the overview captures the preprocessing steps that matter most for model quality.
What would settle it
A controlled experiment in which a data mining task reaches equal or higher accuracy after model tuning alone, with no preprocessing applied, or after using a preprocessing step that fits none of the three categories.
Figures
read the original abstract
Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models to make them more robust to the noise of the input data, their qualities still strongly depend on the quality of it. The article starts with an overview of the data mining pipeline, where the procedures in a data mining task are briefly introduced. Then an overview of the data preprocessing techniques which are categorized as the data cleaning, data transformation and data preprocessing is given. Detailed preprocessing methods, as well as their influenced on the data mining models, are covered in this article.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey that first sketches the standard data-mining pipeline and then organizes preprocessing techniques under the headings of data cleaning, data transformation, and data reduction. It enumerates common methods in each category and states their typical effects on downstream model quality, reiterating the observation that model performance remains strongly dependent on input-data quality.
Significance. If the descriptions are accurate, the paper offers a compact reference that consolidates well-known preprocessing steps and their qualitative influences. Because the work is purely descriptive and advances no new theorem, experiment, or formal argument, its significance is limited to pedagogical or organizational utility rather than any advance in the field.
minor comments (3)
- Abstract, paragraph 3: the third preprocessing category is labeled 'data preprocessing,' which is circular with the section title; the surrounding text and conventional taxonomy indicate the intended label is 'data reduction.'
- Abstract and §1: the claim that 'their qualities still strongly depend on the quality of it' is presented without any quantitative support or citation; while plausible, the statement is left as an unexamined premise rather than a substantiated observation.
- The manuscript does not indicate the criteria used to select the listed methods or the time window of the cited literature, making it difficult to assess completeness or currency of the survey.
Simulated Author's Rebuttal
We thank the referee for reviewing our survey manuscript and for the recommendation of minor revision. The referee's summary accurately captures the scope and content of the paper as a descriptive overview of the data-mining pipeline and preprocessing categories. No specific major comments were enumerated in the report, so we have no individual points requiring rebuttal or revision at this stage.
Circularity Check
No significant circularity; purely descriptive survey with no derivations or predictions
full rationale
This is a survey paper that organizes existing preprocessing techniques under the standard categories of cleaning, transformation, and reduction. The abstract and provided text contain no equations, formal derivations, predictions, or load-bearing arguments. The observation that downstream model quality depends on input data quality is presented as background knowledge rather than derived from any internal construction. No self-citations, ansatzes, or uniqueness claims are invoked in a way that reduces the content to its own inputs. The structure is consistent with a typical literature overview and introduces no internal inconsistencies or fitted quantities presented as predictions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scikit-learn: Machine learning in Python,
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011
work page 2011
-
[2]
K. R. Coombes, PreProcess: Basic Functions for Pre-Processing Microarrays, 2019, r package version 3.1.7. [Online]. Available: https://CRAN.R-project.org/package=PreProcess
work page 2019
-
[3]
J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011
work page 2011
-
[4]
Mining data streams: a review,
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: a review,” ACM Sigmod Record , vol. 34, no. 2, pp. 18–26, 2005
work page 2005
-
[5]
Wickham, ggplot2: Elegant Graphics for Data Analysis
H. Wickham, ggplot2: Elegant Graphics for Data Analysis . Springer- Verlag New York, 2016. [Online]. Available: http://ggplot2.org
work page 2016
-
[6]
M. Bostock, V . Ogievetsky, and J. Heer, “D 3 data-driven documents,” IEEE transactions on visualization and computer graphics , vol. 17, no. 12, pp. 2301–2309, 2011
work page 2011
-
[7]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural infor- mation processing systems , 2012, pp. 1097–1105
work page 2012
-
[8]
T. Velmurugan and T. Santhanam, “Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,” Journal of computer science , vol. 6, no. 3, p. 363, 2010
work page 2010
-
[9]
Missing value imputation based on k-mean clustering with weighted distance,
B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation based on k-mean clustering with weighted distance,” in International Conference on Contemporary Computing . Springer, 2010, pp. 600– 609
work page 2010
-
[10]
On the unknown attribute values in learning from examples,
J. W. Grzymala-Busse, “On the unknown attribute values in learning from examples,” in International Symposium on Methodologies for Intelligent Systems. Springer, 1991, pp. 368–377
work page 1991
-
[11]
A comparison of several approaches to missing attribute values in data mining,
J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in International Conference on Rough Sets and Current Trends in Computing . Springer, 2000, pp. 378–385
work page 2000
-
[12]
Kantardzic, Data mining: concepts, models, methods, and algo- rithms
M. Kantardzic, Data mining: concepts, models, methods, and algo- rithms. John Wiley & Sons, 2011
work page 2011
-
[13]
A review of statistical outlier methods,
S. Walfish, “A review of statistical outlier methods,” Pharmaceutical technology, vol. 30, no. 11, p. 82, 2006
work page 2006
-
[14]
Distance-based outliers: algo- rithms and applications,
E. M. Knorr, R. T. Ng, and V . Tucakov, “Distance-based outliers: algo- rithms and applications,” The VLDB JournalThe International Journal on Very Large Data Bases , vol. 8, no. 3-4, pp. 237–253, 2000
work page 2000
-
[15]
I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 131–146
work page 2005
-
[16]
Cluster-based outlier detection,
L. Duan, L. Xu, Y . Liu, and J. Lee, “Cluster-based outlier detection,” Annals of Operations Research , vol. 168, no. 1, pp. 151–168, 2009
work page 2009
-
[17]
Multiple instance learning networks for fine-grained sentiment analysis,
S. Angelidis and M. Lapata, “Multiple instance learning networks for fine-grained sentiment analysis,” Transactions of the Association of Computational Linguistics, vol. 6, pp. 17–31, 2018
work page 2018
-
[18]
Distributed representations of words and phrases and their composi- tionality,
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their composi- tionality,” in Advances in neural information processing systems , 2013, pp. 3111–3119
work page 2013
-
[19]
An empirical study of the naive bayes classifier,
I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence , vol. 3, no. 22, 2001, pp. 41–46
work page 2001
-
[20]
S. Garc ´ıa, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, 2015
work page 2015
-
[21]
Attribute transformations for data mining i: Theoretical explorations,
T. Y . Lin, “Attribute transformations for data mining i: Theoretical explorations,” International journal of intelligent systems, vol. 17, no. 2, pp. 213–222, 2002
work page 2002
-
[22]
Data preprocessing for supervised leaning,
S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science , vol. 1, no. 2, pp. 111–117, 2006
work page 2006
-
[23]
Jolliffe, Principal component analysis
I. Jolliffe, Principal component analysis . Springer, 2011
work page 2011
-
[24]
A. M. Mart ´ınez and A. C. Kak, “Pca versus lda,” IEEE transactions on pattern analysis and machine intelligence , vol. 23, no. 2, pp. 228–233, 2001
work page 2001
-
[25]
A survey of dimension reduction techniques,
I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002
work page 2002
-
[26]
An introduction to variable and feature selection,
I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research , vol. 3, no. Mar, pp. 1157–1182, 2003. SEMINAR DATA MINING, JUNE 2019 7
work page 2003
-
[27]
The condensed nearest neighbor rule (corresp.),
P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans- actions on information theory , vol. 14, no. 3, pp. 515–516, 1968
work page 1968
-
[28]
A clustering method for automatic biometric template selection,
A. Lumini and L. Nanni, “A clustering method for automatic biometric template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497, 2006
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.