Preprocessing Methods and Pipelines of Data Mining: An Overview

Canchen Li

arxiv: 1906.08510 · v1 · pith:LXMXL7MEnew · submitted 2019-06-20 · 💻 cs.LG · cs.DB· stat.ML

Preprocessing Methods and Pipelines of Data Mining: An Overview

Canchen Li This is my paper

Pith reviewed 2026-05-25 19:58 UTC · model grok-4.3

classification 💻 cs.LG cs.DBstat.ML

keywords data miningpreprocessingdata cleaningdata transformationdata reductionmachine learning pipeline

0 comments

The pith

Data mining model quality depends more on input data quality than on model robustness techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data mining extracts new knowledge from datasets that are often scattered, noisy, or incomplete. Although effort goes into making models robust to input flaws, the paper states that results still hinge primarily on data quality. It first sketches the full data mining pipeline and then surveys preprocessing methods grouped into cleaning, transformation, and reduction. The survey covers concrete techniques in each group and notes how they shape the performance of later mining steps. A reader cares because this directs attention to data preparation as the lever that most directly improves knowledge extraction.

Core claim

The paper claims that data preprocessing, organized into the categories of cleaning, transformation, and reduction, forms an indispensable stage in the data mining pipeline whose proper execution determines the quality of the knowledge ultimately obtained.

What carries the argument

The three-way categorization of preprocessing into data cleaning, data transformation, and data reduction, which organizes the methods shown to influence downstream model behavior.

Load-bearing premise

That the standard split into cleaning, transformation, and reduction plus the methods listed in the overview captures the preprocessing steps that matter most for model quality.

What would settle it

A controlled experiment in which a data mining task reaches equal or higher accuracy after model tuning alone, with no preprocessing applied, or after using a preprocessing step that fits none of the three categories.

Figures

Figures reproduced from arXiv: 1906.08510 by Canchen Li.

**Figure 2.** Figure 2: The effect of box-cox transformation in linear regression. The feature [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of reduction result with PCA and LDA. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models to make them more robust to the noise of the input data, their qualities still strongly depend on the quality of it. The article starts with an overview of the data mining pipeline, where the procedures in a data mining task are briefly introduced. Then an overview of the data preprocessing techniques which are categorized as the data cleaning, data transformation and data preprocessing is given. Detailed preprocessing methods, as well as their influenced on the data mining models, are covered in this article.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a plain recap of standard preprocessing steps with no new methods, results, or analysis.

read the letter

The paper gives an overview of the data mining pipeline and then groups preprocessing into the usual three buckets: cleaning, transformation, and reduction. It notes that model quality depends on input data quality and lists familiar techniques such as missing-value handling, normalization, and dimensionality reduction along with their typical effects on models. That is the entire contribution. Nothing in the text is original; every method and claim is drawn from the literature it cites, and no experiments, comparisons, or new derivations appear. The organization follows the textbook split that has been standard for years, so it does not offer a different lens or highlight gaps that matter now. The discussion of how each step influences models stays at the level of general statements without new evidence or quantitative support. For a reader who has never seen these steps before, the paper could function as a short checklist or teaching note. For anyone already working in the area it adds nothing. The central dependence claim is correct but was already obvious and is not tested here. I would not bring this to a reading group. It does not merit peer review as a research paper because it advances no claim that needs referee scrutiny. If a journal wants basic tutorial material it could be considered on those terms, but that is not the usual peer-review track.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey that first sketches the standard data-mining pipeline and then organizes preprocessing techniques under the headings of data cleaning, data transformation, and data reduction. It enumerates common methods in each category and states their typical effects on downstream model quality, reiterating the observation that model performance remains strongly dependent on input-data quality.

Significance. If the descriptions are accurate, the paper offers a compact reference that consolidates well-known preprocessing steps and their qualitative influences. Because the work is purely descriptive and advances no new theorem, experiment, or formal argument, its significance is limited to pedagogical or organizational utility rather than any advance in the field.

minor comments (3)

Abstract, paragraph 3: the third preprocessing category is labeled 'data preprocessing,' which is circular with the section title; the surrounding text and conventional taxonomy indicate the intended label is 'data reduction.'
Abstract and §1: the claim that 'their qualities still strongly depend on the quality of it' is presented without any quantitative support or citation; while plausible, the statement is left as an unexamined premise rather than a substantiated observation.
The manuscript does not indicate the criteria used to select the listed methods or the time window of the cited literature, making it difficult to assess completeness or currency of the survey.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for reviewing our survey manuscript and for the recommendation of minor revision. The referee's summary accurately captures the scope and content of the paper as a descriptive overview of the data-mining pipeline and preprocessing categories. No specific major comments were enumerated in the report, so we have no individual points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity; purely descriptive survey with no derivations or predictions

full rationale

This is a survey paper that organizes existing preprocessing techniques under the standard categories of cleaning, transformation, and reduction. The abstract and provided text contain no equations, formal derivations, predictions, or load-bearing arguments. The observation that downstream model quality depends on input data quality is presented as background knowledge rather than derived from any internal construction. No self-citations, ansatzes, or uniqueness claims are invoked in a way that reduces the content to its own inputs. The structure is consistent with a typical literature overview and introduces no internal inconsistencies or fitted quantities presented as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. No new mathematical claims, fitted parameters, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5635 in / 925 out tokens · 16891 ms · 2026-05-25T19:58:37.198994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Scikit-learn: Machine learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

work page 2011
[2]

K. R. Coombes, PreProcess: Basic Functions for Pre-Processing Microarrays, 2019, r package version 3.1.7. [Online]. Available: https://CRAN.R-project.org/package=PreProcess

work page 2019
[3]

J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011

work page 2011
[4]

Mining data streams: a review,

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: a review,” ACM Sigmod Record , vol. 34, no. 2, pp. 18–26, 2005

work page 2005
[5]

Wickham, ggplot2: Elegant Graphics for Data Analysis

H. Wickham, ggplot2: Elegant Graphics for Data Analysis . Springer- Verlag New York, 2016. [Online]. Available: http://ggplot2.org

work page 2016
[6]

D 3 data-driven documents,

M. Bostock, V . Ogievetsky, and J. Heer, “D 3 data-driven documents,” IEEE transactions on visualization and computer graphics , vol. 17, no. 12, pp. 2301–2309, 2011

work page 2011
[7]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in neural infor- mation processing systems , 2012, pp. 1097–1105

work page 2012
[8]

Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,

T. Velmurugan and T. Santhanam, “Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,” Journal of computer science , vol. 6, no. 3, p. 363, 2010

work page 2010
[9]

Missing value imputation based on k-mean clustering with weighted distance,

B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation based on k-mean clustering with weighted distance,” in International Conference on Contemporary Computing . Springer, 2010, pp. 600– 609

work page 2010
[10]

On the unknown attribute values in learning from examples,

J. W. Grzymala-Busse, “On the unknown attribute values in learning from examples,” in International Symposium on Methodologies for Intelligent Systems. Springer, 1991, pp. 368–377

work page 1991
[11]

A comparison of several approaches to missing attribute values in data mining,

J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in International Conference on Rough Sets and Current Trends in Computing . Springer, 2000, pp. 378–385

work page 2000
[12]

Kantardzic, Data mining: concepts, models, methods, and algo- rithms

M. Kantardzic, Data mining: concepts, models, methods, and algo- rithms. John Wiley & Sons, 2011

work page 2011
[13]

A review of statistical outlier methods,

S. Walﬁsh, “A review of statistical outlier methods,” Pharmaceutical technology, vol. 30, no. 11, p. 82, 2006

work page 2006
[14]

Distance-based outliers: algo- rithms and applications,

E. M. Knorr, R. T. Ng, and V . Tucakov, “Distance-based outliers: algo- rithms and applications,” The VLDB JournalThe International Journal on Very Large Data Bases , vol. 8, no. 3-4, pp. 237–253, 2000

work page 2000
[15]

Outlier detection,

I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 131–146

work page 2005
[16]

Cluster-based outlier detection,

L. Duan, L. Xu, Y . Liu, and J. Lee, “Cluster-based outlier detection,” Annals of Operations Research , vol. 168, no. 1, pp. 151–168, 2009

work page 2009
[17]

Multiple instance learning networks for ﬁne-grained sentiment analysis,

S. Angelidis and M. Lapata, “Multiple instance learning networks for ﬁne-grained sentiment analysis,” Transactions of the Association of Computational Linguistics, vol. 6, pp. 17–31, 2018

work page 2018
[18]

Distributed representations of words and phrases and their composi- tionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their composi- tionality,” in Advances in neural information processing systems , 2013, pp. 3111–3119

work page 2013
[19]

An empirical study of the naive bayes classiﬁer,

I. Rish et al., “An empirical study of the naive bayes classiﬁer,” in IJCAI 2001 workshop on empirical methods in artiﬁcial intelligence , vol. 3, no. 22, 2001, pp. 41–46

work page 2001
[20]

Garc ´ıa, J

S. Garc ´ıa, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, 2015

work page 2015
[21]

Attribute transformations for data mining i: Theoretical explorations,

T. Y . Lin, “Attribute transformations for data mining i: Theoretical explorations,” International journal of intelligent systems, vol. 17, no. 2, pp. 213–222, 2002

work page 2002
[22]

Data preprocessing for supervised leaning,

S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science , vol. 1, no. 2, pp. 111–117, 2006

work page 2006
[23]

Jolliffe, Principal component analysis

I. Jolliffe, Principal component analysis . Springer, 2011

work page 2011
[24]

Pca versus lda,

A. M. Mart ´ınez and A. C. Kak, “Pca versus lda,” IEEE transactions on pattern analysis and machine intelligence , vol. 23, no. 2, pp. 228–233, 2001

work page 2001
[25]

A survey of dimension reduction techniques,

I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002

work page 2002
[26]

An introduction to variable and feature selection,

I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research , vol. 3, no. Mar, pp. 1157–1182, 2003. SEMINAR DATA MINING, JUNE 2019 7

work page 2003
[27]

The condensed nearest neighbor rule (corresp.),

P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans- actions on information theory , vol. 14, no. 3, pp. 515–516, 1968

work page 1968
[28]

A clustering method for automatic biometric template selection,

A. Lumini and L. Nanni, “A clustering method for automatic biometric template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497, 2006

work page 2006

[1] [1]

Scikit-learn: Machine learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

work page 2011

[2] [2]

K. R. Coombes, PreProcess: Basic Functions for Pre-Processing Microarrays, 2019, r package version 3.1.7. [Online]. Available: https://CRAN.R-project.org/package=PreProcess

work page 2019

[3] [3]

J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011

work page 2011

[4] [4]

Mining data streams: a review,

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: a review,” ACM Sigmod Record , vol. 34, no. 2, pp. 18–26, 2005

work page 2005

[5] [5]

Wickham, ggplot2: Elegant Graphics for Data Analysis

H. Wickham, ggplot2: Elegant Graphics for Data Analysis . Springer- Verlag New York, 2016. [Online]. Available: http://ggplot2.org

work page 2016

[6] [6]

D 3 data-driven documents,

M. Bostock, V . Ogievetsky, and J. Heer, “D 3 data-driven documents,” IEEE transactions on visualization and computer graphics , vol. 17, no. 12, pp. 2301–2309, 2011

work page 2011

[7] [7]

Imagenet classiﬁcation with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in neural infor- mation processing systems , 2012, pp. 1097–1105

work page 2012

[8] [8]

Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,

T. Velmurugan and T. Santhanam, “Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,” Journal of computer science , vol. 6, no. 3, p. 363, 2010

work page 2010

[9] [9]

Missing value imputation based on k-mean clustering with weighted distance,

B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation based on k-mean clustering with weighted distance,” in International Conference on Contemporary Computing . Springer, 2010, pp. 600– 609

work page 2010

[10] [10]

On the unknown attribute values in learning from examples,

J. W. Grzymala-Busse, “On the unknown attribute values in learning from examples,” in International Symposium on Methodologies for Intelligent Systems. Springer, 1991, pp. 368–377

work page 1991

[11] [11]

A comparison of several approaches to missing attribute values in data mining,

J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in International Conference on Rough Sets and Current Trends in Computing . Springer, 2000, pp. 378–385

work page 2000

[12] [12]

Kantardzic, Data mining: concepts, models, methods, and algo- rithms

M. Kantardzic, Data mining: concepts, models, methods, and algo- rithms. John Wiley & Sons, 2011

work page 2011

[13] [13]

A review of statistical outlier methods,

S. Walﬁsh, “A review of statistical outlier methods,” Pharmaceutical technology, vol. 30, no. 11, p. 82, 2006

work page 2006

[14] [14]

Distance-based outliers: algo- rithms and applications,

E. M. Knorr, R. T. Ng, and V . Tucakov, “Distance-based outliers: algo- rithms and applications,” The VLDB JournalThe International Journal on Very Large Data Bases , vol. 8, no. 3-4, pp. 237–253, 2000

work page 2000

[15] [15]

Outlier detection,

I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 131–146

work page 2005

[16] [16]

Cluster-based outlier detection,

L. Duan, L. Xu, Y . Liu, and J. Lee, “Cluster-based outlier detection,” Annals of Operations Research , vol. 168, no. 1, pp. 151–168, 2009

work page 2009

[17] [17]

Multiple instance learning networks for ﬁne-grained sentiment analysis,

S. Angelidis and M. Lapata, “Multiple instance learning networks for ﬁne-grained sentiment analysis,” Transactions of the Association of Computational Linguistics, vol. 6, pp. 17–31, 2018

work page 2018

[18] [18]

Distributed representations of words and phrases and their composi- tionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their composi- tionality,” in Advances in neural information processing systems , 2013, pp. 3111–3119

work page 2013

[19] [19]

An empirical study of the naive bayes classiﬁer,

I. Rish et al., “An empirical study of the naive bayes classiﬁer,” in IJCAI 2001 workshop on empirical methods in artiﬁcial intelligence , vol. 3, no. 22, 2001, pp. 41–46

work page 2001

[20] [20]

Garc ´ıa, J

S. Garc ´ıa, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, 2015

work page 2015

[21] [21]

Attribute transformations for data mining i: Theoretical explorations,

T. Y . Lin, “Attribute transformations for data mining i: Theoretical explorations,” International journal of intelligent systems, vol. 17, no. 2, pp. 213–222, 2002

work page 2002

[22] [22]

Data preprocessing for supervised leaning,

S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science , vol. 1, no. 2, pp. 111–117, 2006

work page 2006

[23] [23]

Jolliffe, Principal component analysis

I. Jolliffe, Principal component analysis . Springer, 2011

work page 2011

[24] [24]

Pca versus lda,

A. M. Mart ´ınez and A. C. Kak, “Pca versus lda,” IEEE transactions on pattern analysis and machine intelligence , vol. 23, no. 2, pp. 228–233, 2001

work page 2001

[25] [25]

A survey of dimension reduction techniques,

I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002

work page 2002

[26] [26]

An introduction to variable and feature selection,

I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research , vol. 3, no. Mar, pp. 1157–1182, 2003. SEMINAR DATA MINING, JUNE 2019 7

work page 2003

[27] [27]

The condensed nearest neighbor rule (corresp.),

P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans- actions on information theory , vol. 14, no. 3, pp. 515–516, 1968

work page 1968

[28] [28]

A clustering method for automatic biometric template selection,

A. Lumini and L. Nanni, “A clustering method for automatic biometric template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497, 2006

work page 2006