pith. sign in

arxiv: 1906.08510 · v1 · pith:LXMXL7MEnew · submitted 2019-06-20 · 💻 cs.LG · cs.DB· stat.ML

Preprocessing Methods and Pipelines of Data Mining: An Overview

Pith reviewed 2026-05-25 19:58 UTC · model grok-4.3

classification 💻 cs.LG cs.DBstat.ML
keywords data miningpreprocessingdata cleaningdata transformationdata reductionmachine learning pipeline
0
0 comments X

The pith

Data mining model quality depends more on input data quality than on model robustness techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data mining extracts new knowledge from datasets that are often scattered, noisy, or incomplete. Although effort goes into making models robust to input flaws, the paper states that results still hinge primarily on data quality. It first sketches the full data mining pipeline and then surveys preprocessing methods grouped into cleaning, transformation, and reduction. The survey covers concrete techniques in each group and notes how they shape the performance of later mining steps. A reader cares because this directs attention to data preparation as the lever that most directly improves knowledge extraction.

Core claim

The paper claims that data preprocessing, organized into the categories of cleaning, transformation, and reduction, forms an indispensable stage in the data mining pipeline whose proper execution determines the quality of the knowledge ultimately obtained.

What carries the argument

The three-way categorization of preprocessing into data cleaning, data transformation, and data reduction, which organizes the methods shown to influence downstream model behavior.

Load-bearing premise

That the standard split into cleaning, transformation, and reduction plus the methods listed in the overview captures the preprocessing steps that matter most for model quality.

What would settle it

A controlled experiment in which a data mining task reaches equal or higher accuracy after model tuning alone, with no preprocessing applied, or after using a preprocessing step that fits none of the three categories.

Figures

Figures reproduced from arXiv: 1906.08510 by Canchen Li.

Figure 1
Figure 1. Figure 1: An illustration of data mining pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The effect of box-cox transformation in linear regression. The feature [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A comparison of reduction result with PCA and LDA. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models to make them more robust to the noise of the input data, their qualities still strongly depend on the quality of it. The article starts with an overview of the data mining pipeline, where the procedures in a data mining task are briefly introduced. Then an overview of the data preprocessing techniques which are categorized as the data cleaning, data transformation and data preprocessing is given. Detailed preprocessing methods, as well as their influenced on the data mining models, are covered in this article.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey that first sketches the standard data-mining pipeline and then organizes preprocessing techniques under the headings of data cleaning, data transformation, and data reduction. It enumerates common methods in each category and states their typical effects on downstream model quality, reiterating the observation that model performance remains strongly dependent on input-data quality.

Significance. If the descriptions are accurate, the paper offers a compact reference that consolidates well-known preprocessing steps and their qualitative influences. Because the work is purely descriptive and advances no new theorem, experiment, or formal argument, its significance is limited to pedagogical or organizational utility rather than any advance in the field.

minor comments (3)
  1. Abstract, paragraph 3: the third preprocessing category is labeled 'data preprocessing,' which is circular with the section title; the surrounding text and conventional taxonomy indicate the intended label is 'data reduction.'
  2. Abstract and §1: the claim that 'their qualities still strongly depend on the quality of it' is presented without any quantitative support or citation; while plausible, the statement is left as an unexamined premise rather than a substantiated observation.
  3. The manuscript does not indicate the criteria used to select the listed methods or the time window of the cited literature, making it difficult to assess completeness or currency of the survey.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for reviewing our survey manuscript and for the recommendation of minor revision. The referee's summary accurately captures the scope and content of the paper as a descriptive overview of the data-mining pipeline and preprocessing categories. No specific major comments were enumerated in the report, so we have no individual points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity; purely descriptive survey with no derivations or predictions

full rationale

This is a survey paper that organizes existing preprocessing techniques under the standard categories of cleaning, transformation, and reduction. The abstract and provided text contain no equations, formal derivations, predictions, or load-bearing arguments. The observation that downstream model quality depends on input data quality is presented as background knowledge rather than derived from any internal construction. No self-citations, ansatzes, or uniqueness claims are invoked in a way that reduces the content to its own inputs. The structure is consistent with a typical literature overview and introduces no internal inconsistencies or fitted quantities presented as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. No new mathematical claims, fitted parameters, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5635 in / 925 out tokens · 16891 ms · 2026-05-25T19:58:37.198994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    Scikit-learn: Machine learning in Python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

  2. [2]

    K. R. Coombes, PreProcess: Basic Functions for Pre-Processing Microarrays, 2019, r package version 3.1.7. [Online]. Available: https://CRAN.R-project.org/package=PreProcess

  3. [3]

    J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011

  4. [4]

    Mining data streams: a review,

    M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: a review,” ACM Sigmod Record , vol. 34, no. 2, pp. 18–26, 2005

  5. [5]

    Wickham, ggplot2: Elegant Graphics for Data Analysis

    H. Wickham, ggplot2: Elegant Graphics for Data Analysis . Springer- Verlag New York, 2016. [Online]. Available: http://ggplot2.org

  6. [6]

    D 3 data-driven documents,

    M. Bostock, V . Ogievetsky, and J. Heer, “D 3 data-driven documents,” IEEE transactions on visualization and computer graphics , vol. 17, no. 12, pp. 2301–2309, 2011

  7. [7]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural infor- mation processing systems , 2012, pp. 1097–1105

  8. [8]

    Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,

    T. Velmurugan and T. Santhanam, “Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points,” Journal of computer science , vol. 6, no. 3, p. 363, 2010

  9. [9]

    Missing value imputation based on k-mean clustering with weighted distance,

    B. M. Patil, R. C. Joshi, and D. Toshniwal, “Missing value imputation based on k-mean clustering with weighted distance,” in International Conference on Contemporary Computing . Springer, 2010, pp. 600– 609

  10. [10]

    On the unknown attribute values in learning from examples,

    J. W. Grzymala-Busse, “On the unknown attribute values in learning from examples,” in International Symposium on Methodologies for Intelligent Systems. Springer, 1991, pp. 368–377

  11. [11]

    A comparison of several approaches to missing attribute values in data mining,

    J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in International Conference on Rough Sets and Current Trends in Computing . Springer, 2000, pp. 378–385

  12. [12]

    Kantardzic, Data mining: concepts, models, methods, and algo- rithms

    M. Kantardzic, Data mining: concepts, models, methods, and algo- rithms. John Wiley & Sons, 2011

  13. [13]

    A review of statistical outlier methods,

    S. Walfish, “A review of statistical outlier methods,” Pharmaceutical technology, vol. 30, no. 11, p. 82, 2006

  14. [14]

    Distance-based outliers: algo- rithms and applications,

    E. M. Knorr, R. T. Ng, and V . Tucakov, “Distance-based outliers: algo- rithms and applications,” The VLDB JournalThe International Journal on Very Large Data Bases , vol. 8, no. 3-4, pp. 237–253, 2000

  15. [15]

    Outlier detection,

    I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 131–146

  16. [16]

    Cluster-based outlier detection,

    L. Duan, L. Xu, Y . Liu, and J. Lee, “Cluster-based outlier detection,” Annals of Operations Research , vol. 168, no. 1, pp. 151–168, 2009

  17. [17]

    Multiple instance learning networks for fine-grained sentiment analysis,

    S. Angelidis and M. Lapata, “Multiple instance learning networks for fine-grained sentiment analysis,” Transactions of the Association of Computational Linguistics, vol. 6, pp. 17–31, 2018

  18. [18]

    Distributed representations of words and phrases and their composi- tionality,

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their composi- tionality,” in Advances in neural information processing systems , 2013, pp. 3111–3119

  19. [19]

    An empirical study of the naive bayes classifier,

    I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence , vol. 3, no. 22, 2001, pp. 41–46

  20. [20]

    Garc ´ıa, J

    S. Garc ´ıa, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, 2015

  21. [21]

    Attribute transformations for data mining i: Theoretical explorations,

    T. Y . Lin, “Attribute transformations for data mining i: Theoretical explorations,” International journal of intelligent systems, vol. 17, no. 2, pp. 213–222, 2002

  22. [22]

    Data preprocessing for supervised leaning,

    S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science , vol. 1, no. 2, pp. 111–117, 2006

  23. [23]

    Jolliffe, Principal component analysis

    I. Jolliffe, Principal component analysis . Springer, 2011

  24. [24]

    Pca versus lda,

    A. M. Mart ´ınez and A. C. Kak, “Pca versus lda,” IEEE transactions on pattern analysis and machine intelligence , vol. 23, no. 2, pp. 228–233, 2001

  25. [25]

    A survey of dimension reduction techniques,

    I. K. Fodor, “A survey of dimension reduction techniques,” Lawrence Livermore National Lab., CA (US), Tech. Rep., 2002

  26. [26]

    An introduction to variable and feature selection,

    I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research , vol. 3, no. Mar, pp. 1157–1182, 2003. SEMINAR DATA MINING, JUNE 2019 7

  27. [27]

    The condensed nearest neighbor rule (corresp.),

    P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE trans- actions on information theory , vol. 14, no. 3, pp. 515–516, 1968

  28. [28]

    A clustering method for automatic biometric template selection,

    A. Lumini and L. Nanni, “A clustering method for automatic biometric template selection,” Pattern Recognition, vol. 39, no. 3, pp. 495–497, 2006