Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models
Pith reviewed 2026-05-21 06:57 UTC · model grok-4.3
The pith
A compact neural network improves advanced fibrosis detection over FIB-4 in MASLD using only the same five routine variables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A shallow-deep neural network with 354 trainable parameters that takes age, FIB-4, aspartate aminotransferase, alanine aminotransferase, and platelet count as inputs achieves external ROC-AUCs of 0.77 in Malaysia and 0.67 in India, compared with FIB-4 values of 0.75 and 0.60 on the same cohorts. The model shows balanced calibration with Brier scores of 0.18 and 0.22 and identifies AST and FIB-4 as the dominant variables by permutation importance.
What carries the argument
The shallow-deep neural network (s-DNN), a compact non-linear model with a few hundred parameters that learns flexible combinations of the five FIB-4 variables to output advanced fibrosis probability.
If this is right
- Routine blood-test panels already contain enough information for modestly better fibrosis staging if combined non-linearly.
- Very small models can match or exceed larger foundation models for this narrow clinical task while remaining easy to deploy.
- AST and the FIB-4 score itself carry most of the predictive signal, so data collection can stay focused on existing labs.
- External validation on two separate cohorts provides evidence that the gain is not limited to the training population.
- Similar compact ML replacements could be tested for other fixed non-invasive scores in liver disease.
Where Pith is reading between the lines
- Integration of the s-DNN into electronic health record systems could enable automatic, real-time fibrosis risk alerts during standard visits.
- Further testing on cohorts that include more Western patients or varied comorbidities would clarify whether the performance edge persists across populations.
- Pairing the model with transient elastography or other imaging NITs might produce combined scores that further reduce the need for biopsy.
- The finding that a tiny network outperforms much larger models suggests that task-specific simplicity can be preferable to general-purpose foundation models in narrow medical applications.
Load-bearing premise
The Malaysian and Indian external cohorts are representative of the broader MASLD population and free of unmeasured selection or label biases that would change the observed performance differences.
What would settle it
A prospective study on an independent biopsy-confirmed MASLD cohort from a different geographic or demographic setting that finds the s-DNN ROC-AUC no higher than FIB-4 would falsify the generalization of the reported improvement.
Figures
read the original abstract
Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates whether a compact shallow-deep neural network (s-DNN) can improve detection of advanced fibrosis in MASLD over the standard FIB-4 index by using the same clinical variables (age, AST, ALT, platelets) plus FIB-4 itself. Training occurs on a Chinese biopsy-confirmed cohort (n=486), with performance reported on two external cohorts from Malaysia and India. The s-DNN (354 parameters) achieves external ROC-AUCs of 0.77 and 0.67 versus FIB-4's 0.75 and 0.60; comparisons are also made to TabPFN and fine-tuned GPT-4o.
Significance. If the modest AUC gains prove robust, the work could support a low-complexity, non-linear enhancement to FIB-4 that requires no additional clinical data collection. External validation on two independent cohorts is a methodological strength, and the emphasis on model compactness (354 trainable parameters) addresses practical deployment constraints in clinical settings.
major comments (3)
- [Abstract/Results] Abstract and Results: The reported AUC improvements are small (Δ=0.02 in Malaysia, Δ=0.07 in India) and the manuscript provides no confidence intervals, DeLong tests, or other statistical comparisons to establish whether these differences exceed sampling variability.
- [Methods/Results] Methods and Results: No cohort-matching statistics, demographic tables, or harmonization details (e.g., NASH CRN vs. other staging systems) are supplied for the Malaysian and Indian external cohorts relative to the Chinese training set. This information is load-bearing for the generalizability claim that the s-DNN's performance reflects architecture rather than unmeasured site or selection effects.
- [Results] Results: While Brier scores (0.18/0.22) and permutation importance (AST and FIB-4 dominant) are reported, the manuscript does not include calibration plots, decision-curve analysis, or sensitivity checks for label noise in the biopsy ground truth across sites.
minor comments (2)
- [Abstract] Abstract: Explicitly state the number of patients in each external cohort (currently only total n=784 is given).
- [Methods] Notation: Clarify whether the s-DNN input includes the pre-computed FIB-4 value as a single feature or its four constituent variables separately.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We value the detailed feedback provided, which has helped us identify areas to improve the clarity and rigor of our work. Below, we address each major comment in turn, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract/Results] Abstract and Results: The reported AUC improvements are small (Δ=0.02 in Malaysia, Δ=0.07 in India) and the manuscript provides no confidence intervals, DeLong tests, or other statistical comparisons to establish whether these differences exceed sampling variability.
Authors: We agree that providing measures of uncertainty and formal statistical comparisons is essential to interpret the modest AUC gains. In the revised manuscript, we will compute and report 95% bootstrap confidence intervals for all AUC values. Additionally, we will perform DeLong tests to assess whether the differences between the s-DNN and FIB-4 (as well as other models) are statistically significant. These results will be added to the Results section and summarized in the Abstract. We note that even small improvements in this clinical context can be meaningful given the low complexity of the model, but we will let the statistical tests speak to their robustness. revision: yes
-
Referee: [Methods/Results] Methods and Results: No cohort-matching statistics, demographic tables, or harmonization details (e.g., NASH CRN vs. other staging systems) are supplied for the Malaysian and Indian external cohorts relative to the Chinese training set. This information is load-bearing for the generalizability claim that the s-DNN's performance reflects architecture rather than unmeasured site or selection effects.
Authors: We acknowledge that detailed cohort comparison is important for assessing generalizability. We will add a new table presenting demographic and clinical characteristics (age, sex, BMI, AST, ALT, platelets, FIB-4, fibrosis stage distribution) for all three cohorts. Regarding staging harmonization, all cohorts were biopsy-confirmed MASLD with fibrosis staged using the NASH CRN system or equivalent histological criteria by expert pathologists; we will explicitly state this and any minor differences in the Methods section. This will help clarify that performance differences are more likely attributable to model architecture than site-specific effects. revision: yes
-
Referee: [Results] Results: While Brier scores (0.18/0.22) and permutation importance (AST and FIB-4 dominant) are reported, the manuscript does not include calibration plots, decision-curve analysis, or sensitivity checks for label noise in the biopsy ground truth across sites.
Authors: We will enhance the Results by including calibration plots (reliability curves) for the s-DNN and FIB-4 in the supplementary materials to visually assess calibration beyond Brier scores. We will also add decision curve analysis to evaluate clinical utility across different threshold probabilities. For label noise in biopsy ground truth, we will add a discussion noting that while all biopsies were reviewed by experienced hepatopathologists, inter-observer variability is a known limitation in fibrosis staging; however, performing a formal sensitivity analysis would require re-reading of slides or additional annotations not available in the current datasets. We will include this as a limitation. revision: partial
Circularity Check
No significant circularity; standard external validation on held-out cohorts.
full rationale
The paper trains shallow-deep NN, TabPFN, and GPT-4o variants on the Chinese cohort (486 train + 54 internal val) and reports performance exclusively on the independent Malaysian and Indian external cohorts. No equations, fitted parameters, or self-citations reduce the reported AUCs, Brier scores, or permutation importances to quantities computed on the test data itself. The derivation chain consists of ordinary supervised learning followed by external evaluation; the central claim that the 354-parameter s-DNN modestly improves on FIB-4 therefore rests on empirical generalization rather than definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- s-DNN trainable parameters
axioms (1)
- domain assumption External cohorts are representative and biopsy labels are reliable
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space... s-DNN with L=3 and (d(1),d(2),d(3))=(17,5,23), yielding 354 trainable parameters.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
External validation performance across the Malaysian and Indian cohorts... s-DNN achieved 0.77 and 0.67
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mary E Rinella, Jeffrey V Lazarus, Vlad Ratziu, Sven M Francque, Arun J Sanyal, Fasiha Kanwal, Diana Romero, Manal F Abdelmalek, Quentin M Anstee, Juan Pablo Arab, et al. A multisociety delphi consensus statement on new fatty liver disease nomenclature.Hepa- tology, 78(6):1966–1986, 2023
work page 1966
-
[2]
Zobair M Younossi, Pegah Golabi, James M Paik, Austin Henry, Catherine Van Dongen, and Linda Henry. The global epidemiology of nonalcoholic fatty liver disease (nafld) and nonalcoholic steatohepatitis (nash): a systematic review.Hepatology, 77(4):1335–1347, 2023
work page 2023
-
[3]
Cheng Han Ng, Wen Hui Lim, Grace En Hui Lim, Darren Jun Hao Tan, Nicholas Syn, Mark D Muthiah, Daniel Q Huang, and Rohit Loomba. Mortality outcomes by fibrosis stage in nonalcoholic fatty liver disease: a systematic review and meta-analysis.Clinical Gastroenterology and Hepatology, 21(4):931–939, 2023
work page 2023
-
[4]
Mirko Zoncape, Antonio Liguori, and Emmanuel A Tsochatzis. Non-invasive testing and risk-stratification in patients with masld.European Journal of Internal Medicine, 122: 11–19, 2024
work page 2024
-
[5]
Richard K Sterling, Eric Lissen, Nathan Clumeck, Ricard Sola, Maria C Correa, Julio Montaner, Mark S Sulkowski, Francesca J Torriani, Douglas T Dieterich, David L Thomas, Daniel Messinger, and Mark Nelson. Development of a simple noninvasive index to predict significant fibrosis in patients with hiv/hcv coinfection.Hepatology, 43(6):1317–1325, 2006. doi: ...
-
[6]
Abdel-Aziz Shaheen, Elizabeth Baguley, Mark G Swain, Matthew Tam, Mang Ming Ma, Giada Sebastiani, Jason Jiang, Frank Lee, Alexandra Medellin, and Juan G Abraldes. Diabetes and obesity reduce fib-4 accuracy in masld referral pathways.JHEP Reports, page 101735, 2026
work page 2026
-
[7]
European Association for the Study of the Liver. Easl clinical practice guidelines on non- invasive tests for evaluation of liver disease severity and prognosis – 2021 update.Journal of Hepatology, 75(3):659–689, 2021. doi: https://doi.org/10.1016/j.jhep.2021.05.025
-
[8]
Richard K Sterling, Keyur Patel, Andres Duarte-Rojo, Sumeet K Asrani, Mouaz Alsawas, Jonathan A Dranoff, Maria I Fiel, M Hassan Murad, Daniel H Leung, Deborah Levine, Tamar H Taddei, Bachir Taouli, and Don C Rockey. Aasld practice guideline on blood- based noninvasive liver disease assessment of hepatic fibrosis and steatosis.Hepatology, 81 (1):321–357, 2...
-
[9]
Frank Tacke, Patrick Horn, Vincent Wai-Sun Wong, Vlad Ratziu, Elisabetta Bugianesi, Sven Francque, Shira Zelber-Sagi, Luca Valenti, Michael Roden, Fritz Schick, Roberto Vettor, Alexandra Kautzky-Willer, Emmanuel A Tsochatzis, and Jörn M Schattenberg. Easl–easd–easo clinical practice guidelines on the management of metabolic dysfunction- associated steatot...
-
[10]
Athanasios Angelakis, Ilias Gatos, Thanasis Loupas, Irene Vafiadis, Emanuel Manesis, and Pavlos Zoumpoulis. A deep learning approach to the non-alcoholic fatty liver disease binary classification problem using patient’s gender and features derived from b-mode ultrasound imagesregardingspeedofsoundandechogenicity. InAmerican College of Radiology Annual Mee...
work page 2018
-
[11]
Athanasios Angelakis, Ilias Gatos, I Theotokas, E Panteleakou, A Kanavaki, A Soultatos, I Vafiadis, E Manesis, and P S Zoumpoulis. Binary classification of chronic liver disease patients using deep learning on morphologic b-mode and demographic data. InAIUM 2018 Annual Convention, New York, NY, 2018. Conference abstract
work page 2018
-
[12]
Athanasios Angelakis, Ilias Gatos, I Theotokas, E Panteleakou, A Kanavaki, A Soultatos, I Vafiadis, E Manesis, and P S Zoumpoulis. A deep learning approach to the significant liver fibrosis binary classification problem using gender, morphologic and hemodynamic measurements derived from b-mode ultrasound images. InEuropean Congress of Radiol- ogy, Vienna,...
-
[13]
Athanasios Angelakis and Tianlu Chen. Lbp-02 - using fib-4’s parameters an explainable black-box machine learning model outperforms fib-4 index on the diagnosis of advanced fibrosis of non alcohol related fatty liver disease patients in three cohorts from china, malaysiaandindia.Journal of Hepatology, 78:S100–S101, 2023. ISSN0168-8278. doi: https: //doi.o...
-
[14]
Athanasios Angelakis. Wed-347 diagnosis of advanced liver fibrosis: the synergy of open data, synthetic data generation, catboost, and feature engineering.Journal of Hepatology, 80:S561, 2024. ISSN0168-8278. doi: https://doi.org/10.1016/S0168-8278(24)01662-3. URL https://www.sciencedirect.com/science/article/pii/S0168827824016623. Abstract Book of EASL Co...
-
[15]
Athanasios Angelakis. Wed-369 a shallow-deep neural network approach combining non- invasive tests to enhance advanced fibrosis detection in metabolic dysfunction–associated steatotic liver disease patients.Journal of Hepatology, 82:S533, 2025. ISSN 0168-8278. doi: https://doi.org/10.1016/S0168-8278(25)01463-1. URL https://www.sciencedirect. com/science/a...
-
[16]
Chao Sang, Hongmei Yan, Wah Kheong Chan, Xiaopeng Zhu, Tao Sun, Xinxia Chang, Mingfeng Xia, Xiaoyang Sun, Xiqi Hu, Xin Gao, Wei Jia, Hua Bian, Tianlu Chen, and Guoxiang Xie. Diagnosis of fibrosis using blood markers and logistic regression in southeast asian patients with non-alcoholic fatty liver disease.Frontiers in Medicine, 8:637652, 2021. doi: 10.338...
-
[17]
Modeling tabular data using conditional gan
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. InAdvances in Neural Information Processing Systems 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 254ed7d2de3b23ab10936522dd547b78-Abstract.html
work page 2019
-
[18]
Catboost: Unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: Unbiased boosting with categorical features. InAdvances in Neu- ral Information Processing Systems 31, pages 6638–6648, 2018. URL https://papers.nips. cc/paper_files/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html
work page 2018
-
[19]
George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2:303–314, 1989. doi: 10.1007/BF02551274
-
[20]
Multilayer feedforward net- works are universal approximators.Neural Networks, 2(5):359–366, 1989
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net- works are universal approximators.Neural Networks, 2(5):359–366, 1989. doi: 10.1016/ 0893-6080(89)90020-8
work page 1989
-
[21]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553): 436–444, 2015. doi: 10.1038/nature14539
-
[22]
Zach-vit: Regime-dependent inductive bias in compact vision trans- formers for medical imaging, 2026
Athanasios Angelakis. Zach-vit: Regime-dependent inductive bias in compact vision trans- formers for medical imaging, 2026. URL https://arxiv.org/abs/2602.17929v2
-
[23]
Paul Angulo, Jennifer M Hui, Giulio Marchesini, Elisabetta Bugianesi, Jacob George, Geof- frey C Farrell, Felicity Enders, Sanjeev Saksena, Alastair D Burt, John P Bida, Keith Lin- dor, Susan O Sanderson, Massimo Lenzi, Leon A Adams, James Kench, Terry M Therneau, and Christopher P Day. The nafld fibrosis score: A noninvasive system that identifies liver ...
-
[24]
Chee T Wai, Joel K Greenson, Robert J Fontana, John D Kalbfleisch, Jorge A Marrero, Hari S Conjeevaram, and Anna S F Lok. A simple noninvasive index can predict both significant fibrosis and cirrhosis in patients with chronic hepatitis c.Hepatology, 38(2): 518–526, 2003. doi: 10.1053/jhep.2003.50346
-
[25]
Eleni-Myrto Trifylli, Aleksandra Leszczynska, Anastasios Kriebardis, Nikolaos Papadopou- los, Melanie Deutsch, and Athanasios Angelakis. Wed-368 three proteins in advanced liver fibrosis: a minimalistic shallowdeep neural network approach on metabolic dysfunction– associated steatotic liver disease patients using open data.Journal of Hepatology, 82: S533,...
-
[26]
Eleni-Myrto Trifylli, Anastasios G Kriebardis, Nikolaos Papadopoulos, Melanie Deutsch, and Athanasios Angelakis. Explainable artificial intelligence on proteomics for the diagnosis of advanced liver fibrosis on masld patients using open data, 2024. AASLD 2024 abstract
work page 2024
-
[27]
Eleni-Myrto Trifylli, Athanasios Angelakis, Anastasios Kriebardis, Nikolaos Papadopou- los, Sotirios Fortis, Vasiliki Pantazatou, Ioannis Koskinas, Hariklia Kranidioti, Evange- los Koustas, Panagiotis Sarantis, Spilios Manolakopoulos, and Melanie Deutsch. Fri- 439 shallow-deep neural networks reveal extracellular vesicles as robust biomarkers for liver st...
-
[28]
Eleni-Myrto Trifylli, Athanasios Angelakis, Anastasios G Kriebardis, Nikolaos Papadopou- los, Sotirios P Fortis, Vasiliki Pantazatou, John Koskinas, Hariklia Kranidioti, Evangelos Koustas, Panagiotis Sarantis, Spilios Manolakopoulos, and Melanie Deutsch. Extracellular vesicles as biomarkers for metabolic dysfunction-associated steatotic liver disease stag...
-
[29]
doi: 10.3748/wjg.v31.i22.106937
-
[30]
Tabpfn: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=cp5PvcI6w8_
work page 2023
-
[31]
David MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation.Journal of Machine Learning Technologies, 2(1):37–63, 2011
work page 2011
-
[32]
Chapman & Hall/CRC, New York, 1993
Bradley Efron and Robert J Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, New York, 1993
work page 1993
-
[33]
James Carpenter and John Bithell. Bootstrap confidence intervals: when, which, what? a practical guide for medical statisticians.Statistics in Medicine, 19(9):1141–1164, 2000
work page 2000
-
[34]
AndrewJVickersandElenaBElkin. Decisioncurveanalysis: anovelmethodforevaluating prediction models.Medical Decision Making, 26(6):565–574, 2006
work page 2006
-
[35]
OpenAI. GPT-4o System Card. https://openai.com/index/gpt-4o-system-card/, 2024. Accessed 2026-05-10
work page 2024
-
[36]
OpenAI. Fine-Tuning Guide. https://platform.openai.com/docs/guides/fine-tuning, 2026. Accessed 2026-05-10
work page 2026
-
[37]
Gabriele De Vito, Filomena Ferrucci, and Athanasios Angelakis. Heliot: Llm-based cdss for adverse drug reaction management.Knowledge-Based Systems, page 114184, 2025. doi: 10.1016/j.knosys.2025.114184
-
[38]
Gabriele De Vito, Filomena Ferrucci, and Athanasios Angelakis. Llms for drug-drug inter- action prediction using textual drug descriptors.Knowledge-Based Systems, page 115486,
-
[39]
doi: 10.1016/j.knosys.2026.115486
-
[40]
Ewout W Steyerberg, Andrew J Vickers, Nancy R Cook, Thomas Gerds, Mithat Gonen, Nancy Obuchowski, Michael J Pencina, and Michael W Kattan. Assessing the performance of prediction models: a framework for traditional and novel measures.Epidemiology, 21 (1):128–138, 2010. 25
work page 2010
-
[41]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning, pages 1321–1330. PMLR, 2017
work page 2017
-
[42]
Random forests.Machine Learning, 45(1):5–32, 2001
Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001
work page 2001
-
[43]
Andre Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. Permutation impor- tance: a corrected feature importance measure.Bioinformatics, 26(10):1340–1347, 2010. 26
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.