Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis
Pith reviewed 2026-05-21 02:26 UTC · model grok-4.3
The pith
Machine learning on multi-feature lncRNA data reveals type 2 diabetes associations that vary by cohort but share a dominant player.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that extracting expression, secondary-structure, and sequence features from the ten lncRNAs and evaluating eight classifiers under multiple validation schemes on two independent cohorts yields cohort-specific associations: GAS5 and XIST expression features plus GAS5, MEG3, and ANRIL sequence features in the first cohort, and MALAT1 expression plus KCNQ1OT1, ANRIL, and MEG3 sequence features in the second, with MEG3 identified by SHAP as the dominant lncRNA in both; the results remain consistent with established statistical methods while supplying population- and subject-level disease association profiles tied to specific molecular feature types.
What carries the argument
The integrative multi-feature framework that extracts expression, secondary-structure, and sequence data from each lncRNA and applies SHAP analysis to generate subject-level association interpretations.
Load-bearing premise
That the ten literature-reported lncRNAs and the chosen expression, secondary structure, and sequence feature types sufficiently represent the relevant biology for detecting T2D associations without critical missing variables or cohort-specific biases.
What would settle it
Observing no significant or contradictory lncRNA associations with T2D when the same feature types are applied to a third independent cohort would indicate the current selections do not generalize.
read the original abstract
Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a multi-modal ML framework to associate ten literature-preselected lncRNAs (MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, HOTAIR) with T2D. Expression, secondary-structure, and sequence features are extracted from two independent RNA-seq cohorts; eight classifiers are trained under stratified k-fold, LOOCV, and repeated hold-out validation; SHAP values identify cohort-specific feature associations (GAS5/XIST expression and GAS5/MEG3/ANRIL sequence in cohort 1; MALAT1 expression and KCNQ1OT1/ANRIL/MEG3 sequence in cohort 2) with MEG3 dominant in both, reported as consistent with conventional statistical tests and enabling population- and subject-level interpretation.
Significance. If the reported associations prove robust, the work would supply a concrete example of how multi-feature ML plus SHAP can move beyond population-level statistics to subject-specific lncRNA-T2D profiles, potentially informing mechanistic hypotheses and lncRNA-targeted precision-medicine strategies. The explicit comparison of three feature classes and the cross-cohort consistency of MEG3 are positive elements.
major comments (3)
- [Abstract / Results] Abstract and Results: the manuscript states that ML results are 'consistent with established statistical methods' and reports specific feature associations, yet supplies no cohort sizes, model performance metrics (accuracy, AUC, F1), confidence intervals, or p-value thresholds. Without these quantities the strength of the claimed associations and the reliability of the SHAP rankings cannot be assessed.
- [Methods] Methods: the input space is restricted a priori to ten literature-reported lncRNAs and three hand-crafted feature families. No genome-wide screen or sensitivity analysis is described; consequently the reported population- and subject-specific profiles are conditional on this narrow prior selection and could change if additional lncRNAs or unmeasured covariates (batch, cell-type composition) were included.
- [Results] Results / Discussion: while SHAP is used to rank features, the manuscript does not report the magnitude or stability of the SHAP values across the different validation schemes (k-fold vs. LOOCV vs. hold-out), leaving open whether the dominance of MEG3 and the cohort-specific attributions are robust or sensitive to the particular train-test split.
minor comments (2)
- [Methods] The precise definitions and extraction pipelines for secondary-structure and sequence features should be stated explicitly (e.g., which folding algorithm, which k-mer or motif statistics) so that the feature set can be reproduced.
- [Figures / Tables] Figure legends and table captions should include the exact number of samples per cohort and per class to allow immediate evaluation of statistical power.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed each major comment below with point-by-point responses. Revisions have been made to enhance the clarity, completeness, and robustness of the reported findings.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the manuscript states that ML results are 'consistent with established statistical methods' and reports specific feature associations, yet supplies no cohort sizes, model performance metrics (accuracy, AUC, F1), confidence intervals, or p-value thresholds. Without these quantities the strength of the claimed associations and the reliability of the SHAP rankings cannot be assessed.
Authors: We agree that these quantitative details are necessary to properly evaluate the strength and reliability of the associations. In the revised manuscript we have added cohort sizes for both RNA-seq cohorts, full performance metrics (accuracy, AUC, F1) with 95% confidence intervals for all eight classifiers under each validation scheme, and the p-value thresholds used for the statistical comparisons. These additions appear in the Results section and have been incorporated into the abstract. revision: yes
-
Referee: [Methods] Methods: the input space is restricted a priori to ten literature-reported lncRNAs and three hand-crafted feature families. No genome-wide screen or sensitivity analysis is described; consequently the reported population- and subject-specific profiles are conditional on this narrow prior selection and could change if additional lncRNAs or unmeasured covariates (batch, cell-type composition) were included.
Authors: The ten lncRNAs were deliberately chosen on the basis of existing literature to enable focused multi-modal analysis of established candidates. We acknowledge that this prior selection conditions the reported profiles and that unmeasured covariates could influence results. In the revision we have expanded the Methods and Discussion sections to explicitly state this limitation, added a brief sensitivity analysis by re-running models after removing one feature family at a time, and noted that a genome-wide screen lies beyond the scope of the present study. revision: partial
-
Referee: [Results] Results / Discussion: while SHAP is used to rank features, the manuscript does not report the magnitude or stability of the SHAP values across the different validation schemes (k-fold vs. LOOCV vs. hold-out), leaving open whether the dominance of MEG3 and the cohort-specific attributions are robust or sensitive to the particular train-test split.
Authors: We have now quantified SHAP value magnitudes and assessed their stability across the three validation schemes. Mean absolute SHAP values with standard deviations across folds are reported in the revised Results section; MEG3 remains the highest-ranked feature in all schemes, and the cohort-specific feature attributions show consistent patterns. A supplementary figure summarizing SHAP stability has been added. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper pre-selects ten lncRNAs from external literature reports, extracts expression/secondary-structure/sequence features from two independent cohorts, trains eight ML classifiers under cross-validation, and applies SHAP for interpretation. The reported associations (e.g., GAS5/XIST expression and MEG3 dominance) are outputs of model training and post-hoc attribution on held-out data, cross-checked against separate statistical tests. No step reduces a claimed prediction to a fitted input by construction, invokes self-citation as the sole justification for a uniqueness claim, or renames a known result; the pipeline remains data-driven and externally benchmarked.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The ten literature-reported lncRNAs are relevant starting points for T2D association analysis.
- standard math Stratified k-fold, LOOCV, and repeated hold-out cross-validation produce robust performance estimates.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We investigated ten literature reported lncRNAs... extracting expression, secondary-structure, and sequence features... Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes... SHAP analysis was applied for subject-level association interpretation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pathophysiology of type 2 diabetes mellitus,
U. Galicia-Garcia, A. Benito-Vicente, S. Jebari, A. Larrea-Sebal, H. Sid- diqi, K. B. Uribe, H. Ostolaza, and C. Mart ´ın, “Pathophysiology of type 2 diabetes mellitus,”Int. J. Mol. Sci., vol. 21, no. 17, p. 6275, 2020
work page 2020
-
[2]
H. Sun, P. Saeedi, S. Karuranga, M. Pinkepank, K. Ogurtsova, B. B. Duncan, C. Stein, A. Basit, J. C. N. Chan, J. C. Mbanya, M. E. Pavkov, A. Ramachandaran, S. H. Wild, S. James, W. H. Herman, P. Zhang, C. Bommer, S. Kuo, E. J. Boyko, and D. J. Magliano, “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and proj...
work page 2021
-
[3]
P. A. Halban, K. S. Polonsky, D. W. Bowden, M. A. Hawkins, C. Ling, K. J. Mather, A. C. Powers, C. J. Rhodes, L. Sussel, and G. C. Weir, “β- Cell failure in type 2 diabetes: postulated mechanisms and prospects for prevention and treatment,”Diabetes Care, vol. 37, no. 6, pp. 1751–1758, 2014
work page 2014
-
[4]
The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,
M. E. Wilson and T. J. Pullen, “The role of long non-coding RNAs in the regulation of pancreatic beta cell identity,”Biochem. Soc. Trans., vol. 49, no. 5, pp. 2153–2161, 2021
work page 2021
-
[5]
Gene regulation by long non-coding RNAs and its biological functions,
L. Statello, C.-J. Guo, L.-L. Chen, and M. Huarte, “Gene regulation by long non-coding RNAs and its biological functions,”Nature Rev. Mol. Cell Biol., vol. 22, no. 2, pp. 96–118, 2021
work page 2021
-
[6]
Genome regulation by long noncoding RNAs,
J. L. Rinn and H. Y . Chang, “Genome regulation by long noncoding RNAs,”Annu. Rev. Biochem., vol. 81, pp. 145–166, 2012
work page 2012
-
[7]
Structure and function of long noncoding RNAs in epigenetic regulation,
T. R. Mercer and J. S. Mattick, “Structure and function of long noncoding RNAs in epigenetic regulation,”Nature Struct. Mol. Biol., vol. 20, no. 3, pp. 300–307, 2013
work page 2013
-
[8]
A. Pandey, S. Ajgaonkar, N. Jadhav, P. Saha, P. Gurav, S. Panda, D. Mehta, and S. Nair, “Current insights into miRNA and lncRNA dys- regulation in diabetes: signal transduction, clinical trials and biomarker discovery,”Pharmaceuticals, vol. 15, no. 10, p. 1269, 2022
work page 2022
-
[9]
C. Sathishkumar, P. Prabu, V . Mohan, and M. Balasubramanyam, “Linking a role of lncRNAs (long non-coding RNAs) with insulin resistance, accelerated senescence, and inflammation in patients with type 2 diabetes,”Hum. Genomics, vol. 12, no. 1, p. 41, 2018
work page 2018
-
[10]
A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,
M. S. Akella, A. Mendonca, T. Manikandan, D. Sateesh, A. R. Swaminathan, D. Parameshwaran, M. Gupta, and S. Sundaresan, “A compendium of noncoding RNAs as biomarkers in type 2 diabetes mellitus,”J. Pharm. Biomed. Anal. Open, vol. 5, p. 100057, 2025
work page 2025
-
[11]
LncRNAs: key players and novel insights into diabetes mellitus,
X. He, C. Ou, Y . Xiao, Q. Han, H. Li, and S. Zhou, “LncRNAs: key players and novel insights into diabetes mellitus,”Oncotarget, vol. 8, no. 41, pp. 71 325–71 341, 2017
work page 2017
-
[12]
The landscape of long noncoding RNAs in the human transcriptome,
M. K. Iyer, Y . S. Niknafs, R. Malik, U. Singhal, A. Sahu, Y . Hosono, T. R. Barrette, J. R. Prensner, J. R. Evans, S. Zhao, A. Poliakov, X. Cao, S. M. Dhanasekaran, Y .-M. Wu, D. R. Robinson, D. G. Beer, F. Y . Feng, H. K. Iyer, and A. M. Chinnaiyan, “The landscape of long noncoding RNAs in the human transcriptome,”Nature Genet., vol. 47, no. 3, pp. 199–...
work page 2015
-
[13]
Multi-omics approaches to disease,
Y . Hasin, M. Seldin, and A. Lusis, “Multi-omics approaches to disease,” Genome Biol., vol. 18, no. 1, p. 83, 2017
work page 2017
-
[14]
Integrative omics for health and disease,
K. J. Karczewski and M. P. Snyder, “Integrative omics for health and disease,”Nature Rev. Genet., vol. 19, no. 5, pp. 299–310, 2018
work page 2018
-
[15]
Dominance statistics: ordinal analyses to answer ordinal questions,
N. Cliff, “Dominance statistics: ordinal analyses to answer ordinal questions,”Psychol. Bull., vol. 114, no. 3, pp. 494–509, 1993
work page 1993
-
[16]
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,
M. I. Love, W. Huber, and S. Anders, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,”Genome Biol., vol. 15, no. 12, p. 550, 2014
work page 2014
-
[17]
A. V . Ponce-Bobadilla, V . Schmitt, C. S. Maier, S. Mensing, and S. Stodtmann, “Practical guide to SHAP analysis: explaining supervised machine learning model predictions in drug development,”Clin. Transl. Sci., vol. 17, no. 11, p. e70056, 2024
work page 2024
-
[18]
L. Marselli, A. Piron, M. Suleiman, M. L. Colli, X. Yi, A. Khamis, G. R. Carrat, G. A. Rutter, M. Bugliani, L. Giusti, M. Ronci, M. Ibberson, J.-V . Turatsinze, U. Boggi, P. D. Simone, V . D. Tata, M. Lopes, D. Nasteska, C. D. Luca, M. Tesi, E. Bosi, P. Singh, D. Campani, A. M. Schulte, M. Solimena, P. Hecht, B. Rady, I. Bakaj, A. Pocai, L. Norquay, B. Th...
work page 2020
-
[19]
L. Wigger, M. Barovic, A.-D. Brunner, F. Marzetta, E. Sch ¨oniger, F. Mehl, N. Kipke, D. Friedland, F. Burdet, C. Kessler, M. Lesche, B. Thorens, E. Bonifacio, C. Legido-Quigley, P. B. S. Hilaire, P. Delerive, A. Dahl, C. Klose, M. J. Gerl, K. Simons, D. Aust, J. Weitz, M. Distler, A. M. Schulte, M. Mann, M. Ibberson, and M. Solimena, “Multi-omics profili...
work page 2021
-
[20]
FastQC: a quality control tool for high throughput se- quence data,
S. Andrews, “FastQC: a quality control tool for high throughput se- quence data,” 2010, babraham Bioinformatics, 2010. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
work page 2010
-
[21]
F. Krueger, “Trim Galore!” 2012, babra- ham Bioinformatics, 2012. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/trim galore/
work page 2012
-
[22]
A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo, “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,”Genome Res., vol. 20, no. 9, pp. 1297–1303, 2010
work page 2010
-
[23]
PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,
R. Muhammod, S. Ahmed, D. M. Farid, S. Shatabda, A. Sharma, and A. Dehzangi, “PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences,”Bioinformatics, vol. 35, no. 19, pp. 3831–3833, 2019
work page 2019
-
[24]
R. P. Bonidia, L. D. H. Sampaio, D. S. Domingues, A. R. Paschoal, F. M. Lopes, A. C. P. L. F. D. Carvalho, and D. S. Sanches, “Feature extraction approaches for biological sequences: a comparative study of mathematical features,”Briefings Bioinform., vol. 22, no. 5, p. bbab011, 2021
work page 2021
-
[25]
Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,
W. Chen, H. Lin, and K.-C. Chou, “Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences,” Mol. BioSyst., vol. 11, no. 10, pp. 2620–2634, 2015
work page 2015
-
[26]
Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,
R. Zhang and C. T. Zhang, “Z curves, an intuitive tool for visualizing and analyzing the DNA sequences,”J. Biomol. Struct. Dyn., vol. 11, no. 4, pp. 767–782, 1994
work page 1994
-
[27]
LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,
L. Huang, H. Zhang, D. Deng, K. Zhao, K. Liu, D. A. Hendrix, and D. H. Mathews, “LinearFold: linear-time approximate RNA folding by 5’-to- 3’ dynamic programming and beam search,” 2020, arXiv:2001.04020, 2020
-
[28]
P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006
work page 2006
-
[29]
E. Taghizadeh, S. Heydarheydari, A. Saberi, S. JafarpoorNesheli, and S. M. Rezaeijo, “Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,”BMC Bioinfor- matics, vol. 23, no. 1, p. 410, 2022
work page 2022
-
[30]
L. Breiman, “Random forests,”Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001
work page 2001
-
[31]
D. W. Hosmer and S. Lemeshow, “Applied logistic regression,”Wiley, 2000
work page 2000
-
[32]
Introduction to machine learning: k-nearest neighbors,
Z. Zhang, “Introduction to machine learning: k-nearest neighbors,”Ann. Transl. Med., vol. 4, no. 11, p. 218, 2016
work page 2016
-
[33]
C. Cortes and V . Vapnik, “Support-vector networks,”Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995
work page 1995
-
[34]
Linear discriminant analysis: a detailed tutorial,
A. Tharwat, T. Gaber, A. Ibrahim, and A. E. Hassanien, “Linear discriminant analysis: a detailed tutorial,”AI Commun., vol. 30, no. 2, pp. 169–190, 2017
work page 2017
-
[35]
J. R. Quinlan, “Induction of decision trees,”Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986
work page 1986
-
[36]
Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,
O. Peretz, M. Koren, and O. Koren, “Naive Bayes classifier – an ensemble procedure for recall and precision enrichment,”Eng. Appl. Artif. Intell., vol. 136, p. 108972, 2024
work page 2024
-
[37]
Estimation of prediction error by using K-fold cross- validation,
T. Fushiki, “Estimation of prediction error by using K-fold cross- validation,”Stat. Comput., vol. 21, no. 2, pp. 137–146, 2011. 15
work page 2011
-
[38]
Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,
H. Cheng, D. J. Garrick, and R. L. Fernando, “Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction,”J. Animal Sci. Biotechnol., vol. 8, no. 1, p. 38, 2017
work page 2017
-
[39]
Model evaluation, model selection, and algorithm selection in machine learning,
S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” 2018, arXiv:1811.12808, 2018
-
[40]
D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,”BMC Genomics, vol. 21, no. 1, p. 6, 2020
work page 2020
-
[41]
K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, F1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,”J. Big Data, vol. 12, no. 1, p. 268, 2025
work page 2025
-
[42]
J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek, “Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys?”Annu. Meeting Florida Assoc. Inst. Res., 2006
work page 2006
-
[43]
A. T. Villikudathil, “Exploring metformin monotherapy response in type-2 diabetes: computational insights through clinical, genomic, and proteomic markers using machine learning algorithms,”Comput. Biol. Med., 2024
work page 2024
-
[44]
Machine learning-based early detection of diabetes risk factors for improved health management,
P. Nuthakki and T. P. Kumar, “Machine learning-based early detection of diabetes risk factors for improved health management,”Multimedia Tools Appl., vol. 83, no. 42, pp. 89 665–89 680, 2024
work page 2024
-
[45]
Machine learning-based stratification of prediabetes and type 2 diabetes progression,
M. Matboli, A. Khaled, M. F. Ahmed, M. Y . Ahmed, R. Khaled, G. M. Elmakromy, A. M. A. Ghani, M. M. El-Shafei, M. R. M. Abdelhalim, and A. M. A. Gwad, “Machine learning-based stratification of prediabetes and type 2 diabetes progression,”Diabetol. Metab. Syndr., vol. 17, no. 1, p. 227, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.