pith. sign in

arxiv: 1907.04318 · v1 · pith:GGS4WUIJnew · submitted 2019-07-09 · 🧬 q-bio.QM · cs.DB· cs.LG· stat.ML

Computer-Aided Data Mining: Automating a Novel Knowledge Discovery and Data Mining Process Model for Metabolomics

Pith reviewed 2026-05-25 00:17 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.DBcs.LGstat.ML
keywords metabolomicsknowledge discoverydata mining process modelautomation softwaretraceable analysisreproducible workflowsversion control
0
0 comments X

The pith

MeKDDaM-SAGA software automates a custom process model to make metabolomics data analysis justifiable, traceable and reproducible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MeKDDaM-SAGA, Java software that implements a novel knowledge discovery and data mining process model tailored to metabolomics objectives and data characteristics. The software supports process execution either through external data mining tools or through its own embedded functions for preprocessing, exploration, acclimatization, modeling, evaluation and visualization. It was applied to multiple metabolomics cases, uses an XML database and GUI, and includes an internal version control system to handle flow, feedback and iterations.

Core claim

MeKDDaM-SAGA realises the layout, structure and flow of the proposed process model through 241 design classes implementing 27 use-cases, enabling external or internal execution of the model phases while guaranteeing portability, user-friendliness and management of iterations via its embedded version control.

What carries the argument

MeKDDaM-SAGA, the object-oriented Java software consisting of 241 classes that implements the novel KDD process model for metabolomics either externally or via its built-in activities.

If this is right

  • Analyses guided by the model can use either external data mining tools or the software's internal preprocessing, modeling and visualization facilities.
  • Process flow, feedback loops and iterations are managed automatically by the embedded version control system, allowing undo and redo of phases and tasks.
  • Portability is ensured by the XML database while the GUI supports direct user interaction with the process.
  • The software satisfies the design and execution requirements of the proposed metabolomics process model in the applications tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of such automation could reduce differences in analysis pipelines between different metabolomics laboratories.
  • The same software architecture might be adapted to create similar guided process models for other data-intensive fields such as proteomics or environmental chemistry.
  • Without published head-to-head comparisons against standard KDD frameworks, it remains open whether the claimed traceability gains exceed those from existing tools.

Load-bearing premise

The authors' custom process model is both novel and required for metabolomics, and the software implements it correctly without adding new errors or limitations.

What would settle it

An independent test that applies the same metabolomics datasets with and without MeKDDaM-SAGA and finds no measurable gain in traceability or reproducibility of results would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 1907.04318 by Ahmed BaniMustafa, Nigel Hardy.

Figure 1
Figure 1. Figure 1: MeKDDaM Process Model- the graphical representation of the process model illustrates the process model structure and the flow of its phases. It also defines the inputs, deliveries and the participants in each phase. The phases are briefly described as follows: 1. Objectives Definition provides a mechanism for defining the process objectives by matching the data mining approaches, goals, and tasks to aims o… view at source ↗
Figure 2
Figure 2. Figure 2: Process Execution Scenarios (Repeated for Reference only: The tree-like graph shows seven process execution scenarios, which illustrates examples of MeKDDaM process model execution that demonstrates the process feedback, rollback, phase iteration, and process iteration mechanisms. 3 Methodology of Software Development The development of MeKDDaM-SAGA has been conducted based on object-oriented software engi… view at source ↗
Figure 3
Figure 3. Figure 3: Process Use-Case Model: A UML Use-Case Model showing interaction between human users and the process 3.2 MeKDDaM-SAGA Design and Construction MeKDDaM-SAGA implementation was constructed using 241 Java classes which were organised into two major packages. The first package was designed to implement and realise the process data model as described in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Software Packages: A UML diagram illustrating the packages in the software environment used for implementing the process 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inputs Class Diagram: A UML class diagram for the process Inputs and their associated classes 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Project Class Diagram:A UML diagram representing the project and its relevant classes Process Phases The Phase class is used for realising MeKDDaM phases and recording information regarding their execution, iteration, and feedbacks. An object of this class is created for each of the eleven process phases in addition to other copies which are instantiated, copied or cloned to resemble the phases’ feedbacks … view at source ↗
Figure 7
Figure 7. Figure 7: Process Class Diagram:A UML diagram representation of the process and its relevant classes The Phase class also implements a number of functions concerning the execution, iteration, and rollback of the process phases, in addition to their persistence, saving, and loading [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Phase Class Diagram: A UML representation of process phases and their associated classes 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Delivery Class Diagram: A UML class diagram representing the process deliveries and their associated classes 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Activity Class Diagram: A UML classes diagram representing the phases activities and their considered issues and relevant classes 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Supplements Class Diagram: A UML classes diagram showing the process supplements and their associated classes 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Process Execution Just Before a Feedback: A snapshot of the process execution just before performing a feedback in order to select a different modelling technique as captured in the Arabidopsis fingerprinting application. The performed phases are shown surrounded by a green light halo 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Process Execution Just After a Feedback: A snapshot of the process execution just after performing a feedback in order to select a different modelling technique. Captured in the Arabidopsis fingerprinting application 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Process Execution Before Process Iteration: A snapshot of the process execution just before performing a process iteration as captured in the cow diet application 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Process Execution after a Second Iteration: A snapshot of the process execution as captured in the cow diet application after the process was iterated in order to achieve new objectives. The process iteration is indicated by a light halo which surrounds the process 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Metabolomics Data: A snapshot of metabolomics data imported to the process as captured in the Arabidopsis isoprenoids profiling application 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Aims of a Metabolomics Study: A snapshot of the aims of a metabolomics study as captured in the Arabidopsis isoprenoids profiling application 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Phase Running: A snapshot of a phase running as captured in the Arabidopsis isoprenoids application 28 [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Phase Prerequisites: A snapshot of a phase prerequisites as captured in the Arabidopsis isoprenoids profiling application 29 [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Phase Objectives: A snapshot of a phase objectives as captured in the Arabidopsis isoprenoids profiling applications 30 [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Phase Planning: A snapshot of a phase planning as captured in the Arabidopsis isoprenoids profiling application 31 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Phase Activity Performer: A snapshot of a performer, which was assigned to a phase activity in the Arabidopsis isoprenoids application 32 [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Phase Performing Justification: A snapshot of a phase performing justification as captured in the Arabidopsis isoprenoids profiling application 33 [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Phase Performing Problem: A snapshot of a phase performing problem as captured in the Arabidopsis isoprenoids profiling application 34 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Phase Validating: A snapshot of a phase activity validation as captured in the Arabidopsis isoprenoids profiling application 35 [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Objectives Definition Phase Outcome: A snapshot of the objectives definition phase Outcome as captured in the Arabidopsis isoprenoids profiling application 36 [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Technique Selection Phase Outcome: A snapshot of the technique selection phase outcome as captured in the Arabidopsis isoprenoids profiling application 37 [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Process Quality Assurance and Standards: A snapshot of the process quality assurance and standards as captured Arabidopsis isoprenoids profiling application 38 [PITH_FULL_IMAGE:figures/full_fig_p038_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: The Assignment of Process Standards: A snapshot of the assignment of standards when delivering the model in the Arabidopsis isoprenoids application 39 [PITH_FULL_IMAGE:figures/full_fig_p039_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Process Human Interaction: A snapshot of the process human interaction as captured in the Arabidopsis isoprenoids profiling application 40 [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Process Resources Management: A snapshot of the process management resources as captured in the Arabidopsis isoprenoids profiling application 41 [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Process Management Constraints: A snapshot of the process management constraints as captured in the Arabidopsis isoprenoids profiling application 42 [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Process traceability:A snapshot of the process traceability information as captured in the Arabidopsis isoprenoids profiling application 43 [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Process Customisation: A snapshot of the process customisation facilities (a) Phase customisation (b) Customisation of the process practical supplements (c) Traceability customisation 44 [PITH_FULL_IMAGE:figures/full_fig_p044_34.png] view at source ↗
Figure 3
Figure 3. Figure 3: figure 3.2(see section 3.3) [PITH_FULL_IMAGE:figures/full_fig_p056_3.png] view at source ↗
read the original abstract

This work presents MeKDDaM-SAGA, computer-aided automation software for implementing a novel knowledge discovery and data mining process model that was designed for performing justifiable, traceable and reproducible metabolomics data analysis. The process model focuses on achieving metabolomics analytical objectives and on considering the nature of its involved data. MeKDDaM-SAGA was successfully used for guiding the process model execution in a number of metabolomics applications. It satisfies the requirements of the proposed process model design and execution. The software realises the process model layout, structure and flow and it enables its execution externally using various data mining and machine learning tools or internally using a number of embedded facilities that were built for performing a number of automated activities such as data preprocessing, data exploration, data acclimatization, modelling, evaluation and visualization. MeKDDaM-SAGA was developed using object-oriented software engineering methodology and was constructed in Java. It consists of 241 design classes that were designed to implement 27 use-cases. The software uses an XML database to guarantee portability and uses a GUI interface to ensure its user-friendliness. It implements an internal embedded version control system that is used to realise and manage the process flow, feedback and iterations and to enable undoing and redoing the execution of the process phases, activities, and the internal tasks within its phases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MeKDDaM-SAGA, a Java-based software system with 241 design classes implementing 27 use-cases, that automates a novel KDD process model for metabolomics. The model emphasizes justifiable, traceable, and reproducible analysis by focusing on analytical objectives and data characteristics. The software supports external DM/ML tools or internal facilities for preprocessing, exploration, acclimatization, modeling, evaluation, and visualization; it uses an XML database for portability and an embedded VCS to manage process flow, feedback, iterations, and undo/redo operations. The central claims are that the software was successfully used to guide the process model in multiple metabolomics applications and that it satisfies the model's design and execution requirements.

Significance. If the claims of successful use and requirement satisfaction were supported by evidence, the work would provide a specialized, traceable automation framework for metabolomics KDD that could improve reproducibility. The object-oriented design, embedded VCS for iteration management, and dual internal/external execution options represent concrete engineering strengths. However, the current absence of any reported applications, metrics, or validation data prevents assessment of actual impact or adoption potential.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'MeKDDaM-SAGA was successfully used for guiding the process model execution in a number of metabolomics applications' and 'satisfies the requirements of the proposed process model design and execution' is presented without any case-study details, datasets, quantitative success criteria, error analysis, reproducibility measures, or comparisons to prior KDD models/tools. This directly undermines evaluation of the central claim.
  2. [The manuscript] The manuscript (architecture and implementation sections): The description of 241 classes, 27 use-cases, internal facilities, and embedded VCS is given at a high level, but no validation is supplied that the implementation correctly realizes the process model phases/activities without introducing errors or limitations, which is load-bearing for the claim of requirement satisfaction.
minor comments (2)
  1. [Abstract] Abstract: The term 'data acclimatization' is introduced without definition or linkage to specific activities in the process model.
  2. [The manuscript] The manuscript: No diagram or table is referenced that maps the 27 use-cases or 241 classes to the process model phases, which would improve clarity of the implementation claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below, acknowledging the manuscript's limitations where evidence is absent.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'MeKDDaM-SAGA was successfully used for guiding the process model execution in a number of metabolomics applications' and 'satisfies the requirements of the proposed process model design and execution' is presented without any case-study details, datasets, quantitative success criteria, error analysis, reproducibility measures, or comparisons to prior KDD models/tools. This directly undermines evaluation of the central claim.

    Authors: We agree that the abstract asserts successful use in applications and satisfaction of requirements without any supporting case studies, datasets, metrics, or comparisons in the manuscript. The provided text describes the software's design (241 classes, 27 use-cases, XML database, embedded VCS) but supplies no empirical evidence for these claims. This is a substantive weakness. We will revise the abstract to remove the unsubstantiated assertions of 'successful use' and instead describe the software as implementing the process model, with any application references qualified or removed. Detailed validation belongs in companion papers. revision: yes

  2. Referee: [The manuscript] The manuscript (architecture and implementation sections): The description of 241 classes, 27 use-cases, internal facilities, and embedded VCS is given at a high level, but no validation is supplied that the implementation correctly realizes the process model phases/activities without introducing errors or limitations, which is load-bearing for the claim of requirement satisfaction.

    Authors: The referee correctly identifies that the architecture and implementation sections offer only a high-level description without any validation (e.g., testing, verification of phase fidelity, or error analysis) that the 241 classes correctly realize the process model. No such evidence appears in the manuscript. We will revise by adding an explicit statement acknowledging the lack of formal validation as a limitation and describing the object-oriented methodology used during construction. Comprehensive validation data are not available for inclusion at this stage. revision: partial

standing simulated objections not resolved
  • Providing the missing case-study details, datasets, quantitative success criteria, error analysis, reproducibility measures, or implementation validation results, as none of these are documented in the current manuscript.

Circularity Check

0 steps flagged

No circularity; software description with no derivations or self-referential logic

full rationale

The paper is a description of custom software (MeKDDaM-SAGA) implementing a claimed novel KDD process model for metabolomics. No equations, fitted parameters, predictions, or derivation chains exist. Claims of successful use and requirement satisfaction are direct assertions without reduction to prior self-citations or internal definitions. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. The central claims rest on implementation statements rather than any self-referential construction, making the paper self-contained against the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical content, free parameters, axioms, or invented entities are present; the paper is a description of software engineering artifacts for a data analysis workflow.

pith-pipeline@v0.9.0 · 5784 in / 1194 out tokens · 22528 ms · 2026-05-25T00:17:17.490958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Plant metabolomics

    Vicki Maloney. Plant metabolomics. BioTeach Journal, 2:92–99, 2004

  2. [2]

    Dettmer and D

    K. Dettmer and D. Hammock. Metabolomics: a new exciting field within the "omics" sciences. Evironmental Health Perspectives, 112(7):A396–A397, 2004

  3. [3]

    W. B. Dunn and D. I. Ellis. Metabolomics: Current analytical platforms and methodologies. Trends in Analytical Chemistry, 24(4):285–294, 2005

  4. [4]

    Jianguo Xia, Nick Psychogios, Nelson Young, and David S. Wishart. MetaboAnalyst- a web server for metabolomic data analysis and interpretation. Nucleic Acids Research, 37(suppl2):W652–660, 2009

  5. [5]

    Royston Goodacre, David Broadhurst, Age Smilde, Bruce Kristal, J. Baker, Richard Beger, Conrad Bessant, Susan Connor, Giorgio Capuani, Andrew Craig, Tim Ebbels, Douglas Kell, Cesare Manetti, Jack Newton, Giovanni Paternostro, Ray Somorjai, Michael Sjostrom, Johan Trygg, and Florian Wulfert. Proposed minimum reporting standards for data analysis in metabol...

  6. [6]

    MeMo: A hybrid SQL/XML approach to metabolomic data management for functional genomics

    Irena Spasic, Warwick Dunn, Giles Velarde, Andy Tseng, Helen Jenkins, Nigel Hardy, Stephen Oliver, and Douglas Kell. MeMo: A hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics, 7(1):281, 2006

  7. [7]

    L. W. Sumner, A. Amberg, D. Barrett, M. H. Beale, R. Beger, C. A. Daykin, T. W. M. Fan, O. Fiehn, R. Goodacre, J. L. Griffin, T. Hankemeier, N. Hardy, J. Harnly, R. Higashi, J. Kopka, A. N. Lane, J. C. Lindon, P. Marriott, A. W. Nicholls, M. D. Reily, J. J. Thaden, and M. R. Viant. Proposed minimum reporting standards for chemical analysis. Metabolomics, 3...

  8. [8]

    Jenkins, H

    H. Jenkins, H. Johnson, B. Kular, T. Wang, and N. Hardy. Toward supportive data collection tools for plant metabolomics. Plant Physiology, 138(1):67–77, 2005

  9. [9]

    Frawley, Gregory Piatetsky-Shapiro, and Christopher J

    William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. Knowledge discovery in databases: an overview. AI Magazine, 13(3):57–70, 1992

  10. [10]

    The KDD process for extracting useful knowledge from volumes of data

    Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD process for extracting useful knowledge from volumes of data. COMMUNICATIONS OF THE ACM, 39(11):27–34, 1996

  11. [11]

    Enhancing learning from imbalanced classes via data preprocessing: A data-driven application in metabolomics data mining

    Ahmed BaniMustafa. Enhancing learning from imbalanced classes via data preprocessing: A data-driven application in metabolomics data mining. The ISC International Journal of Information Security, 11:–, 2019

  12. [12]

    David S. Wishart. Metabolomics- applications to food science and nutrition research. Trends in Food Science and Technology, 19(9):482–493, 2008

  13. [13]

    Applications of liquid chromatog- raphy coupled to mass spectrometry-based metabolomics in clinical chemistry and toxicology - a review

    Aurélie Roux, Dominique Lison, Christophe Junot, and Jean-François Heilier. Applications of liquid chromatog- raphy coupled to mass spectrometry-based metabolomics in clinical chemistry and toxicology - a review. Clinical Biochemistry, 44(1):119–135, 2010

  14. [14]

    Taylor, R

    J. Taylor, R. King, T. Altmann, and O. Fiehn. Application of metabolomics to plant genotype discrimination using statistics and machine learning. BioInformatics, 18(2):241–248, 2002

  15. [15]

    David S. Wishart. Applications of metabolomics in drug discovery and development. Drugs Discovery Plus International, 9:307–322, 2008. [1]

  16. [16]

    The Rational Unified Process: An Introduction

    Philippe Kruchten. The Rational Unified Process: An Introduction. The Addison-Wesley object technology series. Addison-Wesley, 2004. COMPUTER-AIDED DATA MINING A PREPRINT

  17. [17]

    Brachman and Tej Anand

    Ronald J. Brachman and Tej Anand. The process of knowledge discovery in databases: A first sketch. Technical report, AAAI, 1994

  18. [18]

    A Knowledge Discovery and Data Mining Process Model for Metabolomics

    Ahmed BaniMustafa. A Knowledge Discovery and Data Mining Process Model for Metabolomics. PhD thesis, Computer Science Department, 2012

  19. [19]

    Mekddam-saga: A software for automating and guiding a knowledge discovery and data mining process model for metabolomics

    Ahmed BaniMustafa. Mekddam-saga: A software for automating and guiding a knowledge discovery and data mining process model for metabolomics. https://doi.org/10.5281/zenodo.3263394, June 2019

  20. [20]

    A Strategy for Selecting Data Mining Techniques in Metabolomics, volume 860 of Methods in Molecular Biology, chapter 18, pages 317–333

    Ahmed BaniMustafa and Nigel Hardy. A Strategy for Selecting Data Mining Techniques in Metabolomics, volume 860 of Methods in Molecular Biology, chapter 18, pages 317–333. Springer Science, 2012

  21. [21]

    Applications of a novel data mining process model for metabolomics

    Ahmed BaniMustafa and Nigel Hardy. Applications of a novel data mining process model for metabolomics. arXiv preprint, 2019

  22. [22]

    1.0" encoding=

    Lukasz A. Kurgan and Petr Musilek. A survey of knowledge discovery and data mining process models. Knowl. Eng. Rev., 21(1):1–24, 2006. COMPUTER-AIDED DATA MINING A PREPRINT 7 Appendix 7.1 Appendix: Snapshots of the Process Application This appendix provides some snapshots for MeKDDaM-SAGA software environment that was used for process realization and impl...