PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification

Adolphus Wagala; Graciela Gonzalez-Far{\i}as; Oscar Dalmau; Rogelio Ramos

arxiv: 1906.08110 · v1 · pith:BKQHXFJTnew · submitted 2019-06-19 · 📊 stat.CO · stat.ME

PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification

Adolphus Wagala , Graciela Gonzalez-Far{\i}as , Rogelio Ramos , Oscar Dalmau This is my paper

Pith reviewed 2026-05-25 19:58 UTC · model grok-4.3

classification 📊 stat.CO stat.ME

keywords microarray classificationkernel multilogit algorithmpartial least squares generalized linear regressionlogistic regressionlinear discriminant analysisclassification error ratehigh-dimensional data

0 comments

The pith

Kernel multilogit algorithm records the lowest error rates on microarray data among tested classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper combines partial least squares generalized linear regression with logistic regression and with linear discriminant analysis to produce two new classifiers. It then pits these, plus the kernel multilogit algorithm, against KNN, LDA, PLSDA, RPLS and SVM on microarray data. The reported result is that the kernel multilogit algorithm returns the smallest classification error rates whether the data are left raw or preprocessed. Readers care because microarray classification supports medical and biological decisions that depend on accurate separation of high-dimensional gene profiles.

Core claim

When the kernel multilogit algorithm is applied to microarray data, it produces lower classification error rates than the partial least squares generalized linear regression-logistic regression model, the partial least squares generalized linear regression-linear discriminant analysis model, and the classical methods k-nearest neighbours, linear discriminant analysis, partial least squares discriminant analysis, ridge partial least squares and support vector machines, for both un-preprocessed and preprocessed versions of the data.

What carries the argument

The kernel multilogit algorithm (KMA), whose performance is measured by classification error rate against the listed competitors on microarray inputs.

If this is right

KMA should be the default choice when the priority is minimizing misclassification on microarray profiles.
The two PLSGLR extensions remain competitive with but do not surpass KMA or the strongest classical methods.
Preprocessing the data does not reverse the observed ranking of the methods.
SVM, LDA and PLSDA are outperformed by KMA on the tested inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the advantage of KMA stems from its kernel handling of the high-dimensional geometry, similar gains could appear on other gene-expression or proteomics data sets.
Standardized cross-validation or nested tuning would be needed to confirm that the reported ordering survives equal computational budgets.
The comparison leaves open whether KMA retains its edge when the number of classes or the sample size changes substantially.

Load-bearing premise

The performance ordering among the methods is not produced by the particular data sets, the chosen preprocessing steps, or unequal amounts of tuning effort across the classifiers.

What would settle it

Apply all the same methods to a fresh collection of microarray data sets using identical, documented tuning protocols and check whether KMA still records the lowest error rates.

Figures

Figures reproduced from arXiv: 1906.08110 by Adolphus Wagala, Graciela Gonzalez-Far{\i}as, Oscar Dalmau, Rogelio Ramos.

**Figure 2.** Figure 2: Box plot for the preprocessed colon data. This plot presents less variations. The data seem to have a symmetric distribution and do not show the presence of unwanted variation. From the two figures, it is expected that the preprocessed data would be easier to analyze. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: RLE plots for the un-preprocessed and preprocessed colon data. The RLE plot for the un-preprocessed data shows the presence of a lot of heterogeneity, implying that the data have variations that do not necessarily come from biological factors. However, the RLE plot for the processed data shows homogeneity and lack of unwanted noise, and should give better results when analyzed statistically. 13 [PITH_FULL… view at source ↗

**Figure 4.** Figure 4: PCA plot for the un-preprocessed Colon data. The PCA plots show that it is harder to separate/classify the un-preprocessed data. 1 2 43 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 PC1 PC2 status a a Normal Tumor PCA plot for preprocessed co… view at source ↗

**Figure 5.** Figure 5: PCA plots for the preprocessed Colon data. It is relatively easier to separate/classify preprocessed data. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

We implement extensions of the partial least squares generalized linear regression (PLSGLR) due to Bastien et al. (2005) through its combination with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). These two classification methods are then compared with classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM). Furthermore, we implement the kernel multilogit algorithm (KMA) by Dalmau et al. (2015)and compare its performance with that of the other classifiers. The results indicate that for both un-preprocessed and preprocessed data, the KMA has the lowest classification error rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends PLSGLR in two obvious ways and runs a classifier comparison on microarray data, but the abstract supplies zero information on datasets, tuning, or validation, so the KMA-wins claim cannot be assessed.

read the letter

The core of this work is two incremental extensions of Bastien et al.'s PLSGLR—pairing it with logistic regression and with LDA—plus a head-to-head test against KNN, LDA, PLSDA, RPLS, SVM, and the 2015 KMA algorithm. The authors conclude that KMA produces the lowest error rates on both raw and preprocessed microarray data. That is the entire contribution: no new theory, no first-principles derivation, just an application and ranking study built from previously published pieces. The only thing that could make it useful is a careful, reproducible empirical comparison, and that is exactly what is missing from the abstract. There is no mention of how many datasets were used, their dimensions or sample sizes, the cross-validation protocol, the hyperparameter search procedure for any method, or whether error-rate differences were tested for significance. Without those details the reported ordering is impossible to interpret; it could reflect unequal tuning effort, data-specific quirks, or preprocessing choices that favor the kernel method. The stress-test note correctly flags this gap. If the full manuscript contains a complete, documented experimental section with multiple independent partitions and equal tuning budgets, then the paper becomes a narrow but potentially citable application note for people already working on microarray classification. As it stands, the central claim rests on an unsupported ranking. This is not the kind of work I would bring to a reading group or cite in the next year. It does not rise to the level that justifies referee time; an editor should ask for the missing experimental protocol before sending it out.

Referee Report

1 major / 1 minor

Summary. The manuscript extends partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression to form PLSGLR-log and with linear discriminant analysis to form PLSGLRDA. These are compared, together with the kernel multilogit algorithm (KMA), against KNN, LDA, PLSDA, RPLS and SVM on microarray classification tasks; the central claim is that KMA attains the lowest classification error rates on both un-preprocessed and preprocessed data.

Significance. If the reported ordering is shown to be robust under equal hyper-parameter budgets, reproducible cross-validation and multiple independent data partitions, the work would supply a practical kernel-based multilogit classifier for high-dimensional microarray problems and useful PLSGLR extensions for the same domain.

major comments (1)

[Abstract / Results] Abstract and Results section: the claim that KMA has the lowest error rates supplies no information on the number or identity of the microarray data sets, their sample sizes, the cross-validation protocol, the hyper-parameter search procedure applied to each comparator, or any statistical test of the observed error-rate differences. Without these details the performance ranking cannot be evaluated and remains an artifact risk.

minor comments (1)

[Abstract] Abstract: missing space before parenthesis in 'support vector machines(SVM)' and before 'and' in 'Dalmau et al. (2015)and'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The single major comment correctly identifies missing experimental details that are necessary to substantiate the performance claims. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the claim that KMA has the lowest error rates supplies no information on the number or identity of the microarray data sets, their sample sizes, the cross-validation protocol, the hyper-parameter search procedure applied to each comparator, or any statistical test of the observed error-rate differences. Without these details the performance ranking cannot be evaluated and remains an artifact risk.

Authors: We agree that the abstract and results section as currently written do not supply the requested experimental details. In the revised version we will (i) state the exact number and identities of the microarray datasets together with their sample sizes, (ii) describe the cross-validation protocol (including number of folds and repetitions), (iii) document the hyper-parameter search ranges and selection procedure applied uniformly to all comparators, and (iv) report the results of appropriate statistical tests (paired Wilcoxon or t-tests across data partitions) for the observed error-rate differences. These additions will be placed in both the abstract and the results section so that the ranking can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical error-rate ranking on held-out microarray data is self-contained

full rationale

The paper's central claim is an observed ordering of test-set classification error rates across several methods (KNN, LDA, PLSDA, RPLS, SVM, PLSGLR-log, PLSGLRDA, KMA) on microarray datasets, both raw and preprocessed. This ordering is produced by training each classifier on a training partition and evaluating misclassification on a separate test partition; no equation or performance metric is shown to equal a fitted parameter or a quantity defined inside the paper itself. The single self-citation to Dalmau et al. (2015) for the KMA implementation is not load-bearing for the empirical ranking, which rests on direct computation rather than any derivation that collapses to its inputs. The result is therefore falsifiable by re-running the same protocol on new data splits and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5708 in / 1032 out tokens · 25413 ms · 2026-05-25T19:58:04.554285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96:6745--6750

work page 1999
[2]

Alshamlan, H., Badr, G., and Alohali, Y. 2013. A study of cancer microarray gene expression profile: Objectives and approaches. In Proceedings of the World Congress on Engineering, London, U.K., volume II

work page 2013
[3]

Awada, W., Khoshgoftaar, T., Dittman, D., Wald, R., and Napolitano, A. 2012. Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on . In A Review of the Stability of Feature Selection Techniques for Bioinformatics Data, pp. 356--363

work page 2012
[4]

V., and Tenenhaus, M

Bastien, P., Vinzi, E. V., and Tenenhaus, M. 2005. PLS generalised linear regression. Computational Statistics and Data Analysis 48:17--46

work page 2005
[5]

and Keles, S

Chun, H. and Keles, S. 2009. Sparse partial least squares regression for simultaneous dimension reduction and variable selection . Journal of the Royal Statistical Society. Series B, Statistical Methodology 72:3–25

work page 2009
[6]

E., and Gonz\'alez, G

Dalmau, O., Alarc\'on, T. E., and Gonz\'alez, G. 2015. Kernel multilogit algorithm for multiclass classification. Computational Statistics and Data Analysis 82:199--206

work page 2015
[7]

Dong, K., Zhang, F., Zhu, Z., Wang, Z., and Wang, G. 2014. Partial least squares based gene expression analysis in posttraumatic stress disorder. European Review for Medical and Pharmacological Sciences 18:2306--2310

work page 2014
[8]

Dudoit, S., Fridlyand, J., and Speed, T. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97:77--86

work page 2002
[9]

Gagnon-Bartsch, J. A. and Speed, T. 2011. Using control genes to correct for unwanted variation in microarray data . Biostatistics (Oxford, England) 13:539--552

work page 2011
[10]

Gromski, S., Muhamadali, H., Ellis, D., Xu, Y., Correa, E., Turner, M., and Goodcare, R. 2015. A tutorial review: Metabolomics and partial least squares-discriminant analysis – a marriage of convenience or a shotgun wedding. Analytica Chimica Acta 879:10--23

work page 2015
[11]

Gusnanto, A., Ploner, A., Shuweihdi, F., and Pawitan, Y. 2013. Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data. Journal of Biomedical Informatics pp. 697--709

work page 2013
[12]

H \"o skuldsson, A. 1988. PLS regression methods. Journal of Chemometrics 2:211--228

work page 1988
[13]

Huang, C., Tu, S., Huang, C., Lien, H., Lai, L., and Chuang, E. 2013. Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy. BioMed Research International pp. 1--9

work page 2013
[14]

L\^ e Cao, K., Rossouw, D., Robert-Granie\' e , C., and Besse, P. 2008. A Sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology 7(1), Article 35

work page 2008
[15]

Lee, D., Lee, W., Lee, Y., and Pawitan, Y. 2011. Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems 109:1 -- 8

work page 2011
[16]

and Rocke, D

Nguyen, D. and Rocke, D. M. 2002a. Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216--1226

work page
[17]

and Rocke, D

Nguyen, D. and Rocke, D. M. 2002b. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18:39--50

work page
[18]

Telaar, A., Liland, K., Repsilber, D., and N \"u rnberg, G. 2013. An extension of PPLS-DA for classification and comparison to ordinary PLS-DA . PLoS ONE 8 2:e55267

work page 2013
[19]

Wang, A., An, N., Chen, G., Li, L., and Alterovitz, G. 2015. Improving pls–rfe based gene selection for microarray data classification. Computers in Biology and Medicine 62:14--24

work page 2015
[20]

Wold, S., Ruhe, A., Wold, W., and Dunn III, W. J. 1984. The collinearity problem in linear regression, the partial least squares approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5:735--743

work page 1984
[21]

Wold, S., Sj \"o str \"o m, M., and Erikson, L. 2001. PLS -regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 58:109--130

work page 2001
[22]

Xi, B., Gu, H., Baniasadi, H., and Raftery, D. 2014. Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods Mol Biol. 1198:333--353

work page 2014

[1] [1]

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96:6745--6750

work page 1999

[2] [2]

Alshamlan, H., Badr, G., and Alohali, Y. 2013. A study of cancer microarray gene expression profile: Objectives and approaches. In Proceedings of the World Congress on Engineering, London, U.K., volume II

work page 2013

[3] [3]

Awada, W., Khoshgoftaar, T., Dittman, D., Wald, R., and Napolitano, A. 2012. Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on . In A Review of the Stability of Feature Selection Techniques for Bioinformatics Data, pp. 356--363

work page 2012

[4] [4]

V., and Tenenhaus, M

Bastien, P., Vinzi, E. V., and Tenenhaus, M. 2005. PLS generalised linear regression. Computational Statistics and Data Analysis 48:17--46

work page 2005

[5] [5]

and Keles, S

Chun, H. and Keles, S. 2009. Sparse partial least squares regression for simultaneous dimension reduction and variable selection . Journal of the Royal Statistical Society. Series B, Statistical Methodology 72:3–25

work page 2009

[6] [6]

E., and Gonz\'alez, G

Dalmau, O., Alarc\'on, T. E., and Gonz\'alez, G. 2015. Kernel multilogit algorithm for multiclass classification. Computational Statistics and Data Analysis 82:199--206

work page 2015

[7] [7]

Dong, K., Zhang, F., Zhu, Z., Wang, Z., and Wang, G. 2014. Partial least squares based gene expression analysis in posttraumatic stress disorder. European Review for Medical and Pharmacological Sciences 18:2306--2310

work page 2014

[8] [8]

Dudoit, S., Fridlyand, J., and Speed, T. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97:77--86

work page 2002

[9] [9]

Gagnon-Bartsch, J. A. and Speed, T. 2011. Using control genes to correct for unwanted variation in microarray data . Biostatistics (Oxford, England) 13:539--552

work page 2011

[10] [10]

Gromski, S., Muhamadali, H., Ellis, D., Xu, Y., Correa, E., Turner, M., and Goodcare, R. 2015. A tutorial review: Metabolomics and partial least squares-discriminant analysis – a marriage of convenience or a shotgun wedding. Analytica Chimica Acta 879:10--23

work page 2015

[11] [11]

Gusnanto, A., Ploner, A., Shuweihdi, F., and Pawitan, Y. 2013. Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data. Journal of Biomedical Informatics pp. 697--709

work page 2013

[12] [12]

H \"o skuldsson, A. 1988. PLS regression methods. Journal of Chemometrics 2:211--228

work page 1988

[13] [13]

Huang, C., Tu, S., Huang, C., Lien, H., Lai, L., and Chuang, E. 2013. Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy. BioMed Research International pp. 1--9

work page 2013

[14] [14]

L\^ e Cao, K., Rossouw, D., Robert-Granie\' e , C., and Besse, P. 2008. A Sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology 7(1), Article 35

work page 2008

[15] [15]

Lee, D., Lee, W., Lee, Y., and Pawitan, Y. 2011. Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems 109:1 -- 8

work page 2011

[16] [16]

and Rocke, D

Nguyen, D. and Rocke, D. M. 2002a. Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216--1226

work page

[17] [17]

and Rocke, D

Nguyen, D. and Rocke, D. M. 2002b. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18:39--50

work page

[18] [18]

Telaar, A., Liland, K., Repsilber, D., and N \"u rnberg, G. 2013. An extension of PPLS-DA for classification and comparison to ordinary PLS-DA . PLoS ONE 8 2:e55267

work page 2013

[19] [19]

Wang, A., An, N., Chen, G., Li, L., and Alterovitz, G. 2015. Improving pls–rfe based gene selection for microarray data classification. Computers in Biology and Medicine 62:14--24

work page 2015

[20] [20]

Wold, S., Ruhe, A., Wold, W., and Dunn III, W. J. 1984. The collinearity problem in linear regression, the partial least squares approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing 5:735--743

work page 1984

[21] [21]

Wold, S., Sj \"o str \"o m, M., and Erikson, L. 2001. PLS -regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 58:109--130

work page 2001

[22] [22]

Xi, B., Gu, H., Baniasadi, H., and Raftery, D. 2014. Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods Mol Biol. 1198:333--353

work page 2014