Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis

Ashwani Siwach; Sanjeev Narayan Sharma; Sunil Datt Sharma

arxiv: 2605.20747 · v1 · pith:JWTGQBOAnew · submitted 2026-05-20 · 🧬 q-bio.GN · cs.LG

Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis

Ashwani Siwach , Sanjeev Narayan Sharma , Sunil Datt Sharma This is my paper

Pith reviewed 2026-05-21 02:26 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.LG

keywords lncRNAType 2 diabetesMachine learningMulti-feature analysisSHAPCohort studyRNA-seqPrecision medicine

0 comments

The pith

Machine learning on multi-feature lncRNA data reveals type 2 diabetes associations that vary by cohort but share a dominant player.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop an integrative framework that pulls expression, secondary structure, and sequence features from ten literature-reported lncRNAs and feeds them into eight machine learning classifiers. They apply stratified cross-validation, leave-one-out, and repeated hold-out schemes to two independent RNA-seq cohorts, then use SHAP to produce subject-level interpretations. The work finds distinct feature associations in each cohort while identifying MEG3 as the dominant lncRNA across both, and shows that these machine learning results align with but extend beyond standard statistical tests. A reader would care because the approach supplies population-specific and individual-level molecular profiles that could guide more targeted study of T2D mechanisms.

Core claim

The paper establishes that extracting expression, secondary-structure, and sequence features from the ten lncRNAs and evaluating eight classifiers under multiple validation schemes on two independent cohorts yields cohort-specific associations: GAS5 and XIST expression features plus GAS5, MEG3, and ANRIL sequence features in the first cohort, and MALAT1 expression plus KCNQ1OT1, ANRIL, and MEG3 sequence features in the second, with MEG3 identified by SHAP as the dominant lncRNA in both; the results remain consistent with established statistical methods while supplying population- and subject-level disease association profiles tied to specific molecular feature types.

What carries the argument

The integrative multi-feature framework that extracts expression, secondary-structure, and sequence data from each lncRNA and applies SHAP analysis to generate subject-level association interpretations.

Load-bearing premise

That the ten literature-reported lncRNAs and the chosen expression, secondary structure, and sequence feature types sufficiently represent the relevant biology for detecting T2D associations without critical missing variables or cohort-specific biases.

What would settle it

Observing no significant or contradictory lncRNA associations with T2D when the same feature types are applied to a third independent cohort would indicate the current selections do not generalize.

read the original abstract

Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies standard multi-modal ML and SHAP to ten pre-selected lncRNAs across two T2D cohorts to surface some cohort-specific associations, with MEG3 prominent, but stays within a narrow literature-derived input set.

read the letter

The key point is that the authors apply a suite of machine learning classifiers to expression, sequence, and secondary structure features from ten literature-reported lncRNAs in two T2D cohorts, using SHAP to interpret associations at the subject level. MEG3 stands out in both cohorts, with some differences in other lncRNAs between groups. They report GAS5 and XIST expression plus certain sequence features in one cohort, and MALAT1 expression plus others in the second, while noting consistency with statistical tests. What stands out as useful is the multi-feature approach combined with several validation strategies like LOOCV and repeated hold-out. They also check consistency with statistical methods, which gives the results a bit more grounding. The framework for population- and subject-specific profiles is a reasonable way to extend prior association studies. The soft spot is the pre-selection of only those ten lncRNAs. This makes the findings dependent on that initial choice from the literature, raising the chance of missing stronger signals or having bias if those candidates correlate with unmeasured confounders like batch effects or cell-type composition. The abstract is light on numbers for performance or sample sizes, so the full paper needs to show those clearly to support the claims. Readers in bioinformatics or computational biology focused on non-coding RNAs in diabetes would get the most from this. It offers a template for similar integrative analyses. The work shows clear thinking on the methods side and honest use of existing tools, so it deserves a serious referee to sort out the details. I'd recommend putting it through peer review rather than rejecting it outright, with notes on justifying the lncRNA list and reporting full metrics.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a multi-modal ML framework to associate ten literature-preselected lncRNAs (MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, HOTAIR) with T2D. Expression, secondary-structure, and sequence features are extracted from two independent RNA-seq cohorts; eight classifiers are trained under stratified k-fold, LOOCV, and repeated hold-out validation; SHAP values identify cohort-specific feature associations (GAS5/XIST expression and GAS5/MEG3/ANRIL sequence in cohort 1; MALAT1 expression and KCNQ1OT1/ANRIL/MEG3 sequence in cohort 2) with MEG3 dominant in both, reported as consistent with conventional statistical tests and enabling population- and subject-level interpretation.

Significance. If the reported associations prove robust, the work would supply a concrete example of how multi-feature ML plus SHAP can move beyond population-level statistics to subject-specific lncRNA-T2D profiles, potentially informing mechanistic hypotheses and lncRNA-targeted precision-medicine strategies. The explicit comparison of three feature classes and the cross-cohort consistency of MEG3 are positive elements.

major comments (3)

[Abstract / Results] Abstract and Results: the manuscript states that ML results are 'consistent with established statistical methods' and reports specific feature associations, yet supplies no cohort sizes, model performance metrics (accuracy, AUC, F1), confidence intervals, or p-value thresholds. Without these quantities the strength of the claimed associations and the reliability of the SHAP rankings cannot be assessed.
[Methods] Methods: the input space is restricted a priori to ten literature-reported lncRNAs and three hand-crafted feature families. No genome-wide screen or sensitivity analysis is described; consequently the reported population- and subject-specific profiles are conditional on this narrow prior selection and could change if additional lncRNAs or unmeasured covariates (batch, cell-type composition) were included.
[Results] Results / Discussion: while SHAP is used to rank features, the manuscript does not report the magnitude or stability of the SHAP values across the different validation schemes (k-fold vs. LOOCV vs. hold-out), leaving open whether the dominance of MEG3 and the cohort-specific attributions are robust or sensitive to the particular train-test split.

minor comments (2)

[Methods] The precise definitions and extraction pipelines for secondary-structure and sequence features should be stated explicitly (e.g., which folding algorithm, which k-mer or motif statistics) so that the feature set can be reproduced.
[Figures / Tables] Figure legends and table captions should include the exact number of samples per cohort and per class to allow immediate evaluation of statistical power.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed each major comment below with point-by-point responses. Revisions have been made to enhance the clarity, completeness, and robustness of the reported findings.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the manuscript states that ML results are 'consistent with established statistical methods' and reports specific feature associations, yet supplies no cohort sizes, model performance metrics (accuracy, AUC, F1), confidence intervals, or p-value thresholds. Without these quantities the strength of the claimed associations and the reliability of the SHAP rankings cannot be assessed.

Authors: We agree that these quantitative details are necessary to properly evaluate the strength and reliability of the associations. In the revised manuscript we have added cohort sizes for both RNA-seq cohorts, full performance metrics (accuracy, AUC, F1) with 95% confidence intervals for all eight classifiers under each validation scheme, and the p-value thresholds used for the statistical comparisons. These additions appear in the Results section and have been incorporated into the abstract. revision: yes
Referee: [Methods] Methods: the input space is restricted a priori to ten literature-reported lncRNAs and three hand-crafted feature families. No genome-wide screen or sensitivity analysis is described; consequently the reported population- and subject-specific profiles are conditional on this narrow prior selection and could change if additional lncRNAs or unmeasured covariates (batch, cell-type composition) were included.

Authors: The ten lncRNAs were deliberately chosen on the basis of existing literature to enable focused multi-modal analysis of established candidates. We acknowledge that this prior selection conditions the reported profiles and that unmeasured covariates could influence results. In the revision we have expanded the Methods and Discussion sections to explicitly state this limitation, added a brief sensitivity analysis by re-running models after removing one feature family at a time, and noted that a genome-wide screen lies beyond the scope of the present study. revision: partial
Referee: [Results] Results / Discussion: while SHAP is used to rank features, the manuscript does not report the magnitude or stability of the SHAP values across the different validation schemes (k-fold vs. LOOCV vs. hold-out), leaving open whether the dominance of MEG3 and the cohort-specific attributions are robust or sensitive to the particular train-test split.

Authors: We have now quantified SHAP value magnitudes and assessed their stability across the three validation schemes. Mean absolute SHAP values with standard deviations across folds are reported in the revised Results section; MEG3 remains the highest-ranked feature in all schemes, and the cohort-specific feature attributions show consistent patterns. A supplementary figure summarizing SHAP stability has been added. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper pre-selects ten lncRNAs from external literature reports, extracts expression/secondary-structure/sequence features from two independent cohorts, trains eight ML classifiers under cross-validation, and applies SHAP for interpretation. The reported associations (e.g., GAS5/XIST expression and MEG3 dominance) are outputs of model training and post-hoc attribution on held-out data, cross-checked against separate statistical tests. No step reduces a claimed prediction to a fitted input by construction, invokes self-citation as the sole justification for a uniqueness claim, or renames a known result; the pipeline remains data-driven and externally benchmarked.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the relevance of ten pre-selected lncRNAs drawn from prior literature and on standard assumptions that cross-validation schemes yield reliable performance estimates for the chosen feature set.

axioms (2)

domain assumption The ten literature-reported lncRNAs are relevant starting points for T2D association analysis.
Paper begins from established associations without independent selection or validation of the lncRNA list.
standard math Stratified k-fold, LOOCV, and repeated hold-out cross-validation produce robust performance estimates.
Invoked to justify model evaluation without discussion of potential violations in small or imbalanced cohorts.

pith-pipeline@v0.9.0 · 5819 in / 1370 out tokens · 42483 ms · 2026-05-21T02:26:40.444240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We investigated ten literature reported lncRNAs... extracting expression, secondary-structure, and sequence features... Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes... SHAP analysis was applied for subject-level association interpretation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

Pathophysiology of type 2 diabetes mellitus,

U. Galicia-Garcia, A. Benito-Vicente, S. Jebari, A. Larrea-Sebal, H. Sid- diqi, K. B. Uribe, H. Ostolaza, and C. Mart ´ın, “Pathophysiology of type 2 diabetes mellitus,”Int. J. Mol. Sci., vol. 21, no. 17, p. 6275, 2020

work page 2020
[2]

IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045,

H. Sun, P. Saeedi, S. Karuranga, M. Pinkepank, K. Ogurtsova, B. B. Duncan, C. Stein, A. Basit, J. C. N. Chan, J. C. Mbanya, M. E. Pavkov, A. Ramachandaran, S. H. Wild, S. James, W. H. Herman, P. Zhang, C. Bommer, S. Kuo, E. J. Boyko, and D. J. Magliano, “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and proj...

work page 2021
[3]

β- Cell failure in type 2 diabetes: postulated mechanisms and prospects for prevention and treatment,

P. A. Halban, K. S. Polonsky, D. W. Bowden, M. A. Hawkins, C. Ling, K. J. Mather, A. C. Powers, C. J. Rhodes, L. Sussel, and G. C. Weir, “β- Cell failure in type 2 diabetes: postulated mechanisms and prospects for prevention and treatment,”Diabetes Care, vol. 37, no. 6, pp. 1751–1758, 2014

work page 2014
[4]

The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,

M. E. Wilson and T. J. Pullen, “The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,”Biochem. Soc. Trans., vol. 49, no. 5, pp. 2153–2161, 2021

work page 2021
[5]

Gene regulation by long non-coding RNAs and its biological functions,

L. Statello, C.-J. Guo, L.-L. Chen, and M. Huarte, “Gene regulation by long non-coding RNAs and its biological functions,”Nature Rev. Mol. Cell Biol., vol. 22, no. 2, pp. 96–118, 2021

work page 2021
[6]

Genome regulation by long noncoding RNAs,

J. L. Rinn and H. Y . Chang, “Genome regulation by long noncoding RNAs,”Annu. Rev. Biochem., vol. 81, pp. 145–166, 2012

work page 2012
[7]

Structure and function of long noncoding RNAs in epigenetic regulation,

T. R. Mercer and J. S. Mattick, “Structure and function of long noncoding RNAs in epigenetic regulation,”Nature Struct. Mol. Biol., vol. 20, no. 3, pp. 300–307, 2013

work page 2013
[8]

Current insights into miRNA and lncRNA dys- regulation in diabetes: signal transduction, clinical trials and biomarker discovery,

A. Pandey, S. Ajgaonkar, N. Jadhav, P. Saha, P. Gurav, S. Panda, D. Mehta, and S. Nair, “Current insights into miRNA and lncRNA dys- regulation in diabetes: signal transduction, clinical trials and biomarker discovery,”Pharmaceuticals, vol. 15, no. 10, p. 1269, 2022

work page 2022
[9]

Linking a role of lncRNAs (long non-coding RNAs) with insulin resistance, accelerated senescence, and inflammation in patients with type 2 diabetes,

C. Sathishkumar, P. Prabu, V . Mohan, and M. Balasubramanyam, “Linking a role of lncRNAs (long non-coding RNAs) with insulin resistance, accelerated senescence, and inflammation in patients with type 2 diabetes,”Hum. Genomics, vol. 12, no. 1, p. 41, 2018

work page 2018
[10]

A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,

M. S. Akella, A. Mendonca, T. Manikandan, D. Sateesh, A. R. Swaminathan, D. Parameshwaran, M. Gupta, and S. Sundaresan, “A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,”J. Pharm. Biomed. Anal. Open, vol. 5, p. 100057, 2025

work page 2025
[11]

LncRNAs: key players and novel insights into diabetes mellitus,

X. He, C. Ou, Y . Xiao, Q. Han, H. Li, and S. Zhou, “LncRNAs: key players and novel insights into diabetes mellitus,”Oncotarget, vol. 8, no. 41, pp. 71 325–71 341, 2017

work page 2017
[12]

The landscape of long noncoding RNAs in the human transcriptome,

M. K. Iyer, Y . S. Niknafs, R. Malik, U. Singhal, A. Sahu, Y . Hosono, T. R. Barrette, J. R. Prensner, J. R. Evans, S. Zhao, A. Poliakov, X. Cao, S. M. Dhanasekaran, Y .-M. Wu, D. R. Robinson, D. G. Beer, F. Y . Feng, H. K. Iyer, and A. M. Chinnaiyan, “The landscape of long noncoding RNAs in the human transcriptome,”Nature Genet., vol. 47, no. 3, pp. 199–...

work page 2015
[13]

Multi-omics approaches to disease,

Y . Hasin, M. Seldin, and A. Lusis, “Multi-omics approaches to disease,” Genome Biol., vol. 18, no. 1, p. 83, 2017

work page 2017
[14]

Integrative omics for health and disease,

K. J. Karczewski and M. P. Snyder, “Integrative omics for health and disease,”Nature Rev. Genet., vol. 19, no. 5, pp. 299–310, 2018

work page 2018
[15]

Dominance statistics: ordinal analyses to answer ordinal questions,

N. Cliff, “Dominance statistics: ordinal analyses to answer ordinal questions,”Psychol. Bull., vol. 114, no. 3, pp. 494–509, 1993

work page 1993
[16]

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,

M. I. Love, W. Huber, and S. Anders, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,”Genome Biol., vol. 15, no. 12, p. 550, 2014

work page 2014
[17]

Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development,

A. V . Ponce-Bobadilla, V . Schmitt, C. S. Maier, S. Mensing, and S. Stodtmann, “Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development,”Clin. Transl. Sci., vol. 17, no. 11, p. e70056, 2024

work page 2024
[18]

Persistent or transient humanβcell dysfunction induced by metabolic stress: specific signatures and shared gene expression with type 2 diabetes,

L. Marselli, A. Piron, M. Suleiman, M. L. Colli, X. Yi, A. Khamis, G. R. Carrat, G. A. Rutter, M. Bugliani, L. Giusti, M. Ronci, M. Ibberson, J.-V . Turatsinze, U. Boggi, P. D. Simone, V . D. Tata, M. Lopes, D. Nasteska, C. D. Luca, M. Tesi, E. Bosi, P. Singh, D. Campani, A. M. Schulte, M. Solimena, P. Hecht, B. Rady, I. Bakaj, A. Pocai, L. Norquay, B. Th...

work page 2020
[19]

Multi-omics profiling of living human pancreatic islet donors reveals heterogeneous beta cell trajectories towards type 2 diabetes,

L. Wigger, M. Barovic, A.-D. Brunner, F. Marzetta, E. Sch ¨oniger, F. Mehl, N. Kipke, D. Friedland, F. Burdet, C. Kessler, M. Lesche, B. Thorens, E. Bonifacio, C. Legido-Quigley, P. B. S. Hilaire, P. Delerive, A. Dahl, C. Klose, M. J. Gerl, K. Simons, D. Aust, J. Weitz, M. Distler, A. M. Schulte, M. Mann, M. Ibberson, and M. Solimena, “Multi-omics profili...

work page 2021
[20]

FastQC: a quality control tool for high throughput se- quence data,

S. Andrews, “FastQC: a quality control tool for high throughput se- quence data,” 2010, babraham Bioinformatics, 2010. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

work page 2010
[21]

Trim Galore!

F. Krueger, “Trim Galore!” 2012, babra- ham Bioinformatics, 2012. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/trim galore/

work page 2012
[22]

The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,

A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo, “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,”Genome Res., vol. 20, no. 9, pp. 1297–1303, 2010

work page 2010
[23]

PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,

R. Muhammod, S. Ahmed, D. M. Farid, S. Shatabda, A. Sharma, and A. Dehzangi, “PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,”Bioinformatics, vol. 35, no. 19, pp. 3831–3833, 2019

work page 2019
[24]

Feature extraction approaches for biological sequences: a comparative study of mathematical features,

R. P. Bonidia, L. D. H. Sampaio, D. S. Domingues, A. R. Paschoal, F. M. Lopes, A. C. P. L. F. D. Carvalho, and D. S. Sanches, “Feature extraction approaches for biological sequences: a comparative study of mathematical features,”Briefings Bioinform., vol. 22, no. 5, p. bbab011, 2021

work page 2021
[25]

Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,

W. Chen, H. Lin, and K.-C. Chou, “Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,” Mol. BioSyst., vol. 11, no. 10, pp. 2620–2634, 2015

work page 2015
[26]

Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,

R. Zhang and C. T. Zhang, “Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,”J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767–782, 1994

work page 1994
[27]

LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,

L. Huang, H. Zhang, D. Deng, K. Zhao, K. Liu, D. A. Hendrix, and D. H. Mathews, “LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,” 2020, arXiv:2001.04020, 2020

work page arXiv 2020
[28]

Extremely randomized trees,

P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006

work page 2006
[29]

Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,

E. Taghizadeh, S. Heydarheydari, A. Saberi, S. JafarpoorNesheli, and S. M. Rezaeijo, “Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,”BMC Bioinfor- matics, vol. 23, no. 1, p. 410, 2022

work page 2022
[30]

Random forests,

L. Breiman, “Random forests,”Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001

work page 2001
[31]

Applied logistic regression,

D. W. Hosmer and S. Lemeshow, “Applied logistic regression,”Wiley, 2000

work page 2000
[32]

Introduction to machine learning: k-nearest neighbors,

Z. Zhang, “Introduction to machine learning: k-nearest neighbors,”Ann. Transl. Med., vol. 4, no. 11, p. 218, 2016

work page 2016
[33]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,”Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995

work page 1995
[34]

Linear discriminant analysis: a detailed tutorial,

A. Tharwat, T. Gaber, A. Ibrahim, and A. E. Hassanien, “Linear discriminant analysis: a detailed tutorial,”AI Commun., vol. 30, no. 2, pp. 169–190, 2017

work page 2017
[35]

Induction of decision trees,

J. R. Quinlan, “Induction of decision trees,”Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986

work page 1986
[36]

Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,

O. Peretz, M. Koren, and O. Koren, “Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,”Eng. Appl. Artif. Intell., vol. 136, p. 108972, 2024

work page 2024
[37]

Estimation of prediction error by using K-fold cross- validation,

T. Fushiki, “Estimation of prediction error by using K-fold cross- validation,”Stat. Comput., vol. 21, no. 2, pp. 137–146, 2011. 15

work page 2011
[38]

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,

H. Cheng, D. J. Garrick, and R. L. Fernando, “Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,”J. Animal Sci. Biotechnol., vol. 8, no. 1, p. 38, 2017

work page 2017
[39]

Model evaluation, model selection, and algorithm selection in machine learning,

S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” 2018, arXiv:1811.12808, 2018

work page arXiv 2018
[40]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,”BMC Genomics, vol. 21, no. 1, p. 6, 2020

work page 2020
[41]

Accuracy, precision, recall, F1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,

K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, F1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,”J. Big Data, vol. 12, no. 1, p. 268, 2025

work page 2025
[42]

Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys?

J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek, “Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys?”Annu. Meeting Florida Assoc. Inst. Res., 2006

work page 2006
[43]

Exploring metformin monotherapy response in type-2 diabetes: computational insights through clinical, genomic, and proteomic markers using machine learning algorithms,

A. T. Villikudathil, “Exploring metformin monotherapy response in type-2 diabetes: computational insights through clinical, genomic, and proteomic markers using machine learning algorithms,”Comput. Biol. Med., 2024

work page 2024
[44]

Machine learning-based early detection of diabetes risk factors for improved health management,

P. Nuthakki and T. P. Kumar, “Machine learning-based early detection of diabetes risk factors for improved health management,”Multimedia Tools Appl., vol. 83, no. 42, pp. 89 665–89 680, 2024

work page 2024
[45]

Machine learning-based stratification of prediabetes and type 2 diabetes progression,

M. Matboli, A. Khaled, M. F. Ahmed, M. Y . Ahmed, R. Khaled, G. M. Elmakromy, A. M. A. Ghani, M. M. El-Shafei, M. R. M. Abdelhalim, and A. M. A. Gwad, “Machine learning-based stratification of prediabetes and type 2 diabetes progression,”Diabetol. Metab. Syndr., vol. 17, no. 1, p. 227, 2025

work page 2025

[1] [1]

Pathophysiology of type 2 diabetes mellitus,

U. Galicia-Garcia, A. Benito-Vicente, S. Jebari, A. Larrea-Sebal, H. Sid- diqi, K. B. Uribe, H. Ostolaza, and C. Mart ´ın, “Pathophysiology of type 2 diabetes mellitus,”Int. J. Mol. Sci., vol. 21, no. 17, p. 6275, 2020

work page 2020

[2] [2]

IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045,

H. Sun, P. Saeedi, S. Karuranga, M. Pinkepank, K. Ogurtsova, B. B. Duncan, C. Stein, A. Basit, J. C. N. Chan, J. C. Mbanya, M. E. Pavkov, A. Ramachandaran, S. H. Wild, S. James, W. H. Herman, P. Zhang, C. Bommer, S. Kuo, E. J. Boyko, and D. J. Magliano, “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and proj...

work page 2021

[3] [3]

β- Cell failure in type 2 diabetes: postulated mechanisms and prospects for prevention and treatment,

P. A. Halban, K. S. Polonsky, D. W. Bowden, M. A. Hawkins, C. Ling, K. J. Mather, A. C. Powers, C. J. Rhodes, L. Sussel, and G. C. Weir, “β- Cell failure in type 2 diabetes: postulated mechanisms and prospects for prevention and treatment,”Diabetes Care, vol. 37, no. 6, pp. 1751–1758, 2014

work page 2014

[4] [4]

The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,

M. E. Wilson and T. J. Pullen, “The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,”Biochem. Soc. Trans., vol. 49, no. 5, pp. 2153–2161, 2021

work page 2021

[5] [5]

Gene regulation by long non-coding RNAs and its biological functions,

L. Statello, C.-J. Guo, L.-L. Chen, and M. Huarte, “Gene regulation by long non-coding RNAs and its biological functions,”Nature Rev. Mol. Cell Biol., vol. 22, no. 2, pp. 96–118, 2021

work page 2021

[6] [6]

Genome regulation by long noncoding RNAs,

J. L. Rinn and H. Y . Chang, “Genome regulation by long noncoding RNAs,”Annu. Rev. Biochem., vol. 81, pp. 145–166, 2012

work page 2012

[7] [7]

Structure and function of long noncoding RNAs in epigenetic regulation,

T. R. Mercer and J. S. Mattick, “Structure and function of long noncoding RNAs in epigenetic regulation,”Nature Struct. Mol. Biol., vol. 20, no. 3, pp. 300–307, 2013

work page 2013

[8] [8]

Current insights into miRNA and lncRNA dys- regulation in diabetes: signal transduction, clinical trials and biomarker discovery,

A. Pandey, S. Ajgaonkar, N. Jadhav, P. Saha, P. Gurav, S. Panda, D. Mehta, and S. Nair, “Current insights into miRNA and lncRNA dys- regulation in diabetes: signal transduction, clinical trials and biomarker discovery,”Pharmaceuticals, vol. 15, no. 10, p. 1269, 2022

work page 2022

[9] [9]

Linking a role of lncRNAs (long non-coding RNAs) with insulin resistance, accelerated senescence, and inflammation in patients with type 2 diabetes,

C. Sathishkumar, P. Prabu, V . Mohan, and M. Balasubramanyam, “Linking a role of lncRNAs (long non-coding RNAs) with insulin resistance, accelerated senescence, and inflammation in patients with type 2 diabetes,”Hum. Genomics, vol. 12, no. 1, p. 41, 2018

work page 2018

[10] [10]

A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,

M. S. Akella, A. Mendonca, T. Manikandan, D. Sateesh, A. R. Swaminathan, D. Parameshwaran, M. Gupta, and S. Sundaresan, “A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,”J. Pharm. Biomed. Anal. Open, vol. 5, p. 100057, 2025

work page 2025

[11] [11]

LncRNAs: key players and novel insights into diabetes mellitus,

X. He, C. Ou, Y . Xiao, Q. Han, H. Li, and S. Zhou, “LncRNAs: key players and novel insights into diabetes mellitus,”Oncotarget, vol. 8, no. 41, pp. 71 325–71 341, 2017

work page 2017

[12] [12]

The landscape of long noncoding RNAs in the human transcriptome,

M. K. Iyer, Y . S. Niknafs, R. Malik, U. Singhal, A. Sahu, Y . Hosono, T. R. Barrette, J. R. Prensner, J. R. Evans, S. Zhao, A. Poliakov, X. Cao, S. M. Dhanasekaran, Y .-M. Wu, D. R. Robinson, D. G. Beer, F. Y . Feng, H. K. Iyer, and A. M. Chinnaiyan, “The landscape of long noncoding RNAs in the human transcriptome,”Nature Genet., vol. 47, no. 3, pp. 199–...

work page 2015

[13] [13]

Multi-omics approaches to disease,

Y . Hasin, M. Seldin, and A. Lusis, “Multi-omics approaches to disease,” Genome Biol., vol. 18, no. 1, p. 83, 2017

work page 2017

[14] [14]

Integrative omics for health and disease,

K. J. Karczewski and M. P. Snyder, “Integrative omics for health and disease,”Nature Rev. Genet., vol. 19, no. 5, pp. 299–310, 2018

work page 2018

[15] [15]

Dominance statistics: ordinal analyses to answer ordinal questions,

N. Cliff, “Dominance statistics: ordinal analyses to answer ordinal questions,”Psychol. Bull., vol. 114, no. 3, pp. 494–509, 1993

work page 1993

[16] [16]

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,

M. I. Love, W. Huber, and S. Anders, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,”Genome Biol., vol. 15, no. 12, p. 550, 2014

work page 2014

[17] [17]

Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development,

A. V . Ponce-Bobadilla, V . Schmitt, C. S. Maier, S. Mensing, and S. Stodtmann, “Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development,”Clin. Transl. Sci., vol. 17, no. 11, p. e70056, 2024

work page 2024

[18] [18]

Persistent or transient humanβcell dysfunction induced by metabolic stress: specific signatures and shared gene expression with type 2 diabetes,

L. Marselli, A. Piron, M. Suleiman, M. L. Colli, X. Yi, A. Khamis, G. R. Carrat, G. A. Rutter, M. Bugliani, L. Giusti, M. Ronci, M. Ibberson, J.-V . Turatsinze, U. Boggi, P. D. Simone, V . D. Tata, M. Lopes, D. Nasteska, C. D. Luca, M. Tesi, E. Bosi, P. Singh, D. Campani, A. M. Schulte, M. Solimena, P. Hecht, B. Rady, I. Bakaj, A. Pocai, L. Norquay, B. Th...

work page 2020

[19] [19]

Multi-omics profiling of living human pancreatic islet donors reveals heterogeneous beta cell trajectories towards type 2 diabetes,

L. Wigger, M. Barovic, A.-D. Brunner, F. Marzetta, E. Sch ¨oniger, F. Mehl, N. Kipke, D. Friedland, F. Burdet, C. Kessler, M. Lesche, B. Thorens, E. Bonifacio, C. Legido-Quigley, P. B. S. Hilaire, P. Delerive, A. Dahl, C. Klose, M. J. Gerl, K. Simons, D. Aust, J. Weitz, M. Distler, A. M. Schulte, M. Mann, M. Ibberson, and M. Solimena, “Multi-omics profili...

work page 2021

[20] [20]

FastQC: a quality control tool for high throughput se- quence data,

S. Andrews, “FastQC: a quality control tool for high throughput se- quence data,” 2010, babraham Bioinformatics, 2010. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

work page 2010

[21] [21]

Trim Galore!

F. Krueger, “Trim Galore!” 2012, babra- ham Bioinformatics, 2012. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/trim galore/

work page 2012

[22] [22]

The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,

A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo, “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,”Genome Res., vol. 20, no. 9, pp. 1297–1303, 2010

work page 2010

[23] [23]

PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,

R. Muhammod, S. Ahmed, D. M. Farid, S. Shatabda, A. Sharma, and A. Dehzangi, “PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,”Bioinformatics, vol. 35, no. 19, pp. 3831–3833, 2019

work page 2019

[24] [24]

Feature extraction approaches for biological sequences: a comparative study of mathematical features,

R. P. Bonidia, L. D. H. Sampaio, D. S. Domingues, A. R. Paschoal, F. M. Lopes, A. C. P. L. F. D. Carvalho, and D. S. Sanches, “Feature extraction approaches for biological sequences: a comparative study of mathematical features,”Briefings Bioinform., vol. 22, no. 5, p. bbab011, 2021

work page 2021

[25] [25]

Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,

W. Chen, H. Lin, and K.-C. Chou, “Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,” Mol. BioSyst., vol. 11, no. 10, pp. 2620–2634, 2015

work page 2015

[26] [26]

Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,

R. Zhang and C. T. Zhang, “Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,”J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767–782, 1994

work page 1994

[27] [27]

LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,

L. Huang, H. Zhang, D. Deng, K. Zhao, K. Liu, D. A. Hendrix, and D. H. Mathews, “LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,” 2020, arXiv:2001.04020, 2020

work page arXiv 2020

[28] [28]

Extremely randomized trees,

P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006

work page 2006

[29] [29]

Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,

E. Taghizadeh, S. Heydarheydari, A. Saberi, S. JafarpoorNesheli, and S. M. Rezaeijo, “Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,”BMC Bioinfor- matics, vol. 23, no. 1, p. 410, 2022

work page 2022

[30] [30]

Random forests,

L. Breiman, “Random forests,”Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001

work page 2001

[31] [31]

Applied logistic regression,

D. W. Hosmer and S. Lemeshow, “Applied logistic regression,”Wiley, 2000

work page 2000

[32] [32]

Introduction to machine learning: k-nearest neighbors,

Z. Zhang, “Introduction to machine learning: k-nearest neighbors,”Ann. Transl. Med., vol. 4, no. 11, p. 218, 2016

work page 2016

[33] [33]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,”Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995

work page 1995

[34] [34]

Linear discriminant analysis: a detailed tutorial,

A. Tharwat, T. Gaber, A. Ibrahim, and A. E. Hassanien, “Linear discriminant analysis: a detailed tutorial,”AI Commun., vol. 30, no. 2, pp. 169–190, 2017

work page 2017

[35] [35]

Induction of decision trees,

J. R. Quinlan, “Induction of decision trees,”Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986

work page 1986

[36] [36]

Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,

O. Peretz, M. Koren, and O. Koren, “Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,”Eng. Appl. Artif. Intell., vol. 136, p. 108972, 2024

work page 2024

[37] [37]

Estimation of prediction error by using K-fold cross- validation,

T. Fushiki, “Estimation of prediction error by using K-fold cross- validation,”Stat. Comput., vol. 21, no. 2, pp. 137–146, 2011. 15

work page 2011

[38] [38]

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,

H. Cheng, D. J. Garrick, and R. L. Fernando, “Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,”J. Animal Sci. Biotechnol., vol. 8, no. 1, p. 38, 2017

work page 2017

[39] [39]

Model evaluation, model selection, and algorithm selection in machine learning,

S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” 2018, arXiv:1811.12808, 2018

work page arXiv 2018

[40] [40]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,”BMC Genomics, vol. 21, no. 1, p. 6, 2020

work page 2020

[41] [41]

Accuracy, precision, recall, F1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,

K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, F1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,”J. Big Data, vol. 12, no. 1, p. 268, 2025

work page 2025

[42] [42]

Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys?

J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek, “Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys?”Annu. Meeting Florida Assoc. Inst. Res., 2006

work page 2006

[43] [43]

Exploring metformin monotherapy response in type-2 diabetes: computational insights through clinical, genomic, and proteomic markers using machine learning algorithms,

A. T. Villikudathil, “Exploring metformin monotherapy response in type-2 diabetes: computational insights through clinical, genomic, and proteomic markers using machine learning algorithms,”Comput. Biol. Med., 2024

work page 2024

[44] [44]

Machine learning-based early detection of diabetes risk factors for improved health management,

P. Nuthakki and T. P. Kumar, “Machine learning-based early detection of diabetes risk factors for improved health management,”Multimedia Tools Appl., vol. 83, no. 42, pp. 89 665–89 680, 2024

work page 2024

[45] [45]

Machine learning-based stratification of prediabetes and type 2 diabetes progression,

M. Matboli, A. Khaled, M. F. Ahmed, M. Y . Ahmed, R. Khaled, G. M. Elmakromy, A. M. A. Ghani, M. M. El-Shafei, M. R. M. Abdelhalim, and A. M. A. Gwad, “Machine learning-based stratification of prediabetes and type 2 diabetes progression,”Diabetol. Metab. Syndr., vol. 17, no. 1, p. 227, 2025

work page 2025