pith. sign in

arxiv: 1906.08110 · v1 · pith:BKQHXFJTnew · submitted 2019-06-19 · 📊 stat.CO · stat.ME

PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification

Pith reviewed 2026-05-25 19:58 UTC · model grok-4.3

classification 📊 stat.CO stat.ME
keywords microarray classificationkernel multilogit algorithmpartial least squares generalized linear regressionlogistic regressionlinear discriminant analysisclassification error ratehigh-dimensional data
0
0 comments X

The pith

Kernel multilogit algorithm records the lowest error rates on microarray data among tested classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper combines partial least squares generalized linear regression with logistic regression and with linear discriminant analysis to produce two new classifiers. It then pits these, plus the kernel multilogit algorithm, against KNN, LDA, PLSDA, RPLS and SVM on microarray data. The reported result is that the kernel multilogit algorithm returns the smallest classification error rates whether the data are left raw or preprocessed. Readers care because microarray classification supports medical and biological decisions that depend on accurate separation of high-dimensional gene profiles.

Core claim

When the kernel multilogit algorithm is applied to microarray data, it produces lower classification error rates than the partial least squares generalized linear regression-logistic regression model, the partial least squares generalized linear regression-linear discriminant analysis model, and the classical methods k-nearest neighbours, linear discriminant analysis, partial least squares discriminant analysis, ridge partial least squares and support vector machines, for both un-preprocessed and preprocessed versions of the data.

What carries the argument

The kernel multilogit algorithm (KMA), whose performance is measured by classification error rate against the listed competitors on microarray inputs.

If this is right

  • KMA should be the default choice when the priority is minimizing misclassification on microarray profiles.
  • The two PLSGLR extensions remain competitive with but do not surpass KMA or the strongest classical methods.
  • Preprocessing the data does not reverse the observed ranking of the methods.
  • SVM, LDA and PLSDA are outperformed by KMA on the tested inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the advantage of KMA stems from its kernel handling of the high-dimensional geometry, similar gains could appear on other gene-expression or proteomics data sets.
  • Standardized cross-validation or nested tuning would be needed to confirm that the reported ordering survives equal computational budgets.
  • The comparison leaves open whether KMA retains its edge when the number of classes or the sample size changes substantially.

Load-bearing premise

The performance ordering among the methods is not produced by the particular data sets, the chosen preprocessing steps, or unequal amounts of tuning effort across the classifiers.

What would settle it

Apply all the same methods to a fresh collection of microarray data sets using identical, documented tuning protocols and check whether KMA still records the lowest error rates.

Figures

Figures reproduced from arXiv: 1906.08110 by Adolphus Wagala, Graciela Gonzalez-Far{\i}as, Oscar Dalmau, Rogelio Ramos.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Box plot for the preprocessed colon data. This plot presents less variations. The data seem to have a symmetric distribution and do not show the presence of unwanted variation. From the two figures, it is expected that the preprocessed data would be easier to analyze. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RLE plots for the un-preprocessed and preprocessed colon data. The RLE plot for the un-preprocessed data shows the presence of a lot of heterogeneity, implying that the data have variations that do not necessarily come from biological factors. However, the RLE plot for the processed data shows homogeneity and lack of unwanted noise, and should give better results when analyzed statistically. 13 [PITH_FULL… view at source ↗
Figure 4
Figure 4. Figure 4: PCA plot for the un-preprocessed Colon data. The PCA plots show that it is harder to separate/classify the un-preprocessed data. 1 2 43 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 PC1 PC2 status a a Normal Tumor PCA plot for preprocessed co… view at source ↗
Figure 5
Figure 5. Figure 5: PCA plots for the preprocessed Colon data. It is relatively easier to sepa￾rate/classify preprocessed data. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

We implement extensions of the partial least squares generalized linear regression (PLSGLR) due to Bastien et al. (2005) through its combination with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). These two classification methods are then compared with classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM). Furthermore, we implement the kernel multilogit algorithm (KMA) by Dalmau et al. (2015)and compare its performance with that of the other classifiers. The results indicate that for both un-preprocessed and preprocessed data, the KMA has the lowest classification error rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript extends partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression to form PLSGLR-log and with linear discriminant analysis to form PLSGLRDA. These are compared, together with the kernel multilogit algorithm (KMA), against KNN, LDA, PLSDA, RPLS and SVM on microarray classification tasks; the central claim is that KMA attains the lowest classification error rates on both un-preprocessed and preprocessed data.

Significance. If the reported ordering is shown to be robust under equal hyper-parameter budgets, reproducible cross-validation and multiple independent data partitions, the work would supply a practical kernel-based multilogit classifier for high-dimensional microarray problems and useful PLSGLR extensions for the same domain.

major comments (1)
  1. [Abstract / Results] Abstract and Results section: the claim that KMA has the lowest error rates supplies no information on the number or identity of the microarray data sets, their sample sizes, the cross-validation protocol, the hyper-parameter search procedure applied to each comparator, or any statistical test of the observed error-rate differences. Without these details the performance ranking cannot be evaluated and remains an artifact risk.
minor comments (1)
  1. [Abstract] Abstract: missing space before parenthesis in 'support vector machines(SVM)' and before 'and' in 'Dalmau et al. (2015)and'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The single major comment correctly identifies missing experimental details that are necessary to substantiate the performance claims. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: the claim that KMA has the lowest error rates supplies no information on the number or identity of the microarray data sets, their sample sizes, the cross-validation protocol, the hyper-parameter search procedure applied to each comparator, or any statistical test of the observed error-rate differences. Without these details the performance ranking cannot be evaluated and remains an artifact risk.

    Authors: We agree that the abstract and results section as currently written do not supply the requested experimental details. In the revised version we will (i) state the exact number and identities of the microarray datasets together with their sample sizes, (ii) describe the cross-validation protocol (including number of folds and repetitions), (iii) document the hyper-parameter search ranges and selection procedure applied uniformly to all comparators, and (iv) report the results of appropriate statistical tests (paired Wilcoxon or t-tests across data partitions) for the observed error-rate differences. These additions will be placed in both the abstract and the results section so that the ranking can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical error-rate ranking on held-out microarray data is self-contained

full rationale

The paper's central claim is an observed ordering of test-set classification error rates across several methods (KNN, LDA, PLSDA, RPLS, SVM, PLSGLR-log, PLSGLRDA, KMA) on microarray datasets, both raw and preprocessed. This ordering is produced by training each classifier on a training partition and evaluating misclassification on a separate test partition; no equation or performance metric is shown to equal a fitted parameter or a quantity defined inside the paper itself. The single self-citation to Dalmau et al. (2015) for the KMA implementation is not load-bearing for the empirical ranking, which rests on direct computation rather than any derivation that collapses to its inputs. The result is therefore falsifiable by re-running the same protocol on new data splits and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5708 in / 1032 out tokens · 25413 ms · 2026-05-25T19:58:04.554285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96:6745--6750

  2. [2]

    Alshamlan, H., Badr, G., and Alohali, Y. 2013. A study of cancer microarray gene expression profile: Objectives and approaches. In Proceedings of the World Congress on Engineering, London, U.K., volume II

  3. [3]

    Awada, W., Khoshgoftaar, T., Dittman, D., Wald, R., and Napolitano, A. 2012. Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on . In A Review of the Stability of Feature Selection Techniques for Bioinformatics Data, pp. 356--363

  4. [4]

    V., and Tenenhaus, M

    Bastien, P., Vinzi, E. V., and Tenenhaus, M. 2005. PLS generalised linear regression. Computational Statistics and Data Analysis 48:17--46

  5. [5]

    and Keles, S

    Chun, H. and Keles, S. 2009. Sparse partial least squares regression for simultaneous dimension reduction and variable selection . Journal of the Royal Statistical Society. Series B, Statistical Methodology 72:3–25

  6. [6]

    E., and Gonz\'alez, G

    Dalmau, O., Alarc\'on, T. E., and Gonz\'alez, G. 2015. Kernel multilogit algorithm for multiclass classification. Computational Statistics and Data Analysis 82:199--206

  7. [7]

    Dong, K., Zhang, F., Zhu, Z., Wang, Z., and Wang, G. 2014. Partial least squares based gene expression analysis in posttraumatic stress disorder. European Review for Medical and Pharmacological Sciences 18:2306--2310

  8. [8]

    Dudoit, S., Fridlyand, J., and Speed, T. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97:77--86

  9. [9]

    Gagnon-Bartsch, J. A. and Speed, T. 2011. Using control genes to correct for unwanted variation in microarray data . Biostatistics (Oxford, England) 13:539--552

  10. [10]

    Gromski, S., Muhamadali, H., Ellis, D., Xu, Y., Correa, E., Turner, M., and Goodcare, R. 2015. A tutorial review: Metabolomics and partial least squares-discriminant analysis – a marriage of convenience or a shotgun wedding. Analytica Chimica Acta 879:10--23

  11. [11]

    Gusnanto, A., Ploner, A., Shuweihdi, F., and Pawitan, Y. 2013. Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data. Journal of Biomedical Informatics pp. 697--709

  12. [12]

    H \"o skuldsson, A. 1988. PLS regression methods. Journal of Chemometrics 2:211--228

  13. [13]

    Huang, C., Tu, S., Huang, C., Lien, H., Lai, L., and Chuang, E. 2013. Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy. BioMed Research International pp. 1--9

  14. [14]

    L\^ e Cao, K., Rossouw, D., Robert-Granie\' e , C., and Besse, P. 2008. A Sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology 7(1), Article 35

  15. [15]

    Lee, D., Lee, W., Lee, Y., and Pawitan, Y. 2011. Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems 109:1 -- 8

  16. [16]

    and Rocke, D

    Nguyen, D. and Rocke, D. M. 2002a. Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216--1226

  17. [17]

    and Rocke, D

    Nguyen, D. and Rocke, D. M. 2002b. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18:39--50

  18. [18]

    Telaar, A., Liland, K., Repsilber, D., and N \"u rnberg, G. 2013. An extension of PPLS-DA for classification and comparison to ordinary PLS-DA . PLoS ONE 8 2:e55267

  19. [19]

    Wang, A., An, N., Chen, G., Li, L., and Alterovitz, G. 2015. Improving pls–rfe based gene selection for microarray data classification. Computers in Biology and Medicine 62:14--24

  20. [20]

    Wold, S., Ruhe, A., Wold, W., and Dunn III, W. J. 1984. The collinearity problem in linear regression, the partial least squares approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5:735--743

  21. [21]

    Wold, S., Sj \"o str \"o m, M., and Erikson, L. 2001. PLS -regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 58:109--130

  22. [22]

    Xi, B., Gu, H., Baniasadi, H., and Raftery, D. 2014. Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods Mol Biol. 1198:333--353