Automatic Construction of Clinical Scoring Systems with LLM Agents
Pith reviewed 2026-05-25 07:21 UTC · model grok-4.3
The pith
LLM agents can generate clinical scoring systems that outperform prior methods and match flexible models on eight tasks while exceeding established guidelines on external validation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentScore performs semantically guided optimization in the space of unit-weighted clinical checklists by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
What carries the argument
AgentScore, an LLM-driven proposal step followed by a deterministic verification-and-selection loop that filters rules for validity and deployability.
If this is right
- Clinical scoring systems can be produced automatically for many prediction tasks instead of requiring manual expert rule design.
- Strong performance is possible even when scores are restricted to simple unit-weighted checklists of binary rules.
- Automatically generated scores can exceed the discrimination of existing manual guidelines on externally validated tasks.
- The structural limits of deployable guidelines need not prevent them from reaching accuracy levels close to less constrained interpretable models.
Where Pith is reading between the lines
- The same proposal-and-verification pattern could be applied to create interpretable rule sets in non-clinical areas such as credit risk or safety checklists.
- Even with the verification loop, proposed rules should be audited for unintended patterns that might reflect the LLM's pretraining data rather than the clinical dataset.
- Future tests could measure whether the generated scores actually change clinician behavior or patient outcomes rather than only discrimination metrics.
Load-bearing premise
Rules proposed by the LLM and then filtered by the deterministic loop will produce scores whose performance holds up on new data without being distorted by the LLM's own training biases or selection artifacts.
What would settle it
Finding that AgentScore scores produce lower AUROC than established guideline scores when tested on a fresh external dataset drawn from a different hospital system or time period.
Figures
read the original abstract
Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AgentScore, a method that uses LLM agents to propose candidate rules for clinical scoring systems and applies a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints such as unit weighting and interpretability. It claims that across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite stronger structural constraints. On two additional externally validated tasks, it reports higher discrimination than established guideline-based scores.
Significance. If the empirical results hold under rigorous scrutiny, this work could advance automated construction of deployable clinical guidelines by using LLMs for semantic search over an exponentially large discrete rule space while maintaining statistical controls. The combination of LLM proposal with deterministic verification addresses a key practical barrier in translating ML to bedside tools.
major comments (2)
- [Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.
- [Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have made revisions to strengthen the manuscript's clarity on experimental details and validation protocols.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.
Authors: We agree that the abstract would benefit from additional context to support the central claims. In the revised version, we have expanded the abstract to briefly specify the eight clinical tasks, the score-generation baselines compared, the use of internal cross-validation and external test sets, and the reporting of AUROC with statistical significance testing. Full details on data splits, baseline definitions, and multiple-comparison corrections (Bonferroni-adjusted) remain in the Methods and Results sections, with a new sentence in the abstract directing readers there. We believe this addresses the concern without exceeding abstract length limits. revision: yes
-
Referee: [Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.
Authors: The verification-and-selection loop uses a three-way data split (proposal on training, selection on validation, evaluation on held-out test) to control post-hoc selection effects, with all final scores required to meet pre-specified statistical thresholds on the validation set before external testing. LLM proposal is treated as a semantic prior only; no LLM parameters are fitted to the clinical data. We have added an explicit subsection in Methods detailing the partitioning protocol, a sensitivity analysis removing the top-proposed rules, and a note that LLM training data overlap cannot be fully audited but is mitigated by the deterministic verification step. These additions provide the requested concrete validation protocols. revision: partial
Circularity Check
No significant circularity
full rationale
The paper's central method uses LLM agents to propose candidate rules followed by a deterministic, data-grounded verification-and-selection loop. Claims of outperformance are grounded in empirical AUROC evaluations on eight tasks plus external validation, not by construction from fitted parameters or self-referential definitions. No equations, ansatzes, or self-citations are shown that reduce results to inputs. The architecture separates proposal (LLM) from verification (external data), making the derivation self-contained against benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization
Develops greedy optimization algorithms for directly learning optimal integer-weighted clinical risk scores, applied to predict post-discharge mortality in a large EHR cohort with a supporting simulation study.
Reference graph
Works this paper leans on
-
[1]
ISSN 2305-5847. doi: 10.21037/atm.2019.11
-
[2]
URL http://dx.doi.org/10.21037/atm. 2019.11.121. Chen, R. T. Q., Rubanova, Y ., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equa- tions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc.,
-
[3]
Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl
URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 69386f6bb1dfed68692a24c8686939b9-Paper. pdf. Collins, G. S., Reitsma, J. B., Altman, D. G., and Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement.Annals of Internal Medicine, 162(1): 55–63, January 2015...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.3399/bjgp20x708941 2018
-
[4]
doi: 10.1016/s1088-467x(97) 00008-5
ISSN 1088-467X. doi: 10.1016/s1088-467x(97) 00008-5. URL http://dx.doi.org/10.1016/ S1088-467X(97)00008-5. Dawes, R. M. The robust beauty of improper linear models in decision making.American Psychologist, 34(7):571–582, July 1979. ISSN 0003-066X. doi: 10.1037/0003-066x.34.7.571. URL http://dx.doi. org/10.1037/0003-066X.34.7.571. Desai, N. and Gross, J. S...
-
[5]
doi: 10.1016/0002-9343(94) 90143-0
ISSN 0002-9343. doi: 10.1016/0002-9343(94) 90143-0. URL http://dx.doi.org/10.1016/ 0002-9343(94)90143-0. D’Agostino, R. B., Vasan, R. S., Pencina, M. J., Wolf, P. A., Cobain, M., Massaro, J. M., and Kannel, W. B. General cardiovascular risk profile for use in primary care: The framingham heart study.Circulation, 117(6): 743–753, February 2008. ISSN 1524-4...
-
[6]
ISBN 9780309164232. Granger, C. B. Predictors of hospital mortality in the global registry of acute coronary events.Archives of Internal Medicine, 163(19):2345, October 2003. ISSN 0003-9926. doi: 10.1001/archinte.163.19.2345. URL http://dx. doi.org/10.1001/archinte.163.19.2345. Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., Vi...
-
[7]
cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper
URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper. pdf. Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. Mimic-iv, a freely accessible electronic health record dataset.Scientific...
-
[8]
doi: 10.1214/15-aoas848. URL http://dx. doi.org/10.1214/15-AOAS848. Lim, W S an van der Eerden, M. M., Laing, R., Boersma, W. G., Karalus, N., Town, G. I., Lewis, S. A., and Mac- farlane, J. T. Defining community acquired pneumo- nia severity on presentation to hospital: an international derivation and validation study.Thorax, 58(5):377–382, May 2003. ISS...
-
[9]
Liu, T., Huynh, N., and van der Schaar, M
URL https://openreview.net/forum? id=xTYL1J6Xt-z. Liu, T., Huynh, N., and van der Schaar, M. Decision tree in- duction through LLMs via semantically-aware evolution. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=UyhRtB4hjN. Lundberg, S. M. and Lee, S.-I. A unified approach to in- terpreti...
-
[10]
URL http: //dx.doi.org/10.1136/bmj-2024-082505
doi: 10.1136/bmj-2024-082505. URL http: //dx.doi.org/10.1136/bmj-2024-082505. Nam, J., Kim, K., Oh, S., Tack, J., Kim, J., and Shin, J. Optimized feature generation for tabular data via LLMs with decision tree reasoning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[11]
Nori, H., Jenkins, S., Koch, P., and Caruana, R
URL https://openreview.net/forum? id=APSBwuMopO. Nori, H., Jenkins, S., Koch, P., and Caruana, R. Interpretml: A unified framework for machine learning interpretabil- ity, 2019. URL https://arxiv.org/abs/1909. 09223. Olesen, J., Torp-Pedersen, C., Hansen, M., and Lip, G. The value of the cha2ds2-vasc score for refining stroke risk stratification in patien...
-
[12]
doi: 10.1016/s0001-2998(78) 80013-0
ISSN 0001-2998. doi: 10.1016/s0001-2998(78) 80013-0. URL http://dx.doi.org/10.1016/ s0001-2998(78)80013-0. Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. The eicu collabora- tive research database, a freely available multi-center database for critical care research.Scientific Data, 5(1), September 2018. ISSN 2052...
-
[13]
URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002
doi: 10.1016/j.jcf.2019.03.002. URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002. Ribeiro, M. T., Singh, S., and Guestrin, C. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY , USA, 2016. Asso...
-
[14]
ISSN 0098-7484. doi: 10.1001/jama.1993. 03500090063034. URL http://dx.doi.org/10. 1001/jama.1993.03500090063034. Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. Deep ehr: A survey of recent advances in deep learning tech- niques for electronic health record (ehr) analysis.IEEE Journal of Biomedical and Health Informatics, 22(5): 1589–1604, Septemb...
-
[15]
URL http:// dx.doi.org/10.4103/0970-1591.91438
doi: 10.4103/0970-1591.91438. URL http:// dx.doi.org/10.4103/0970-1591.91438. 13 AgentScore: Autoformulation of Deployable Clinical Scoring Systems Takita, H., Kabata, D., Walston, S. L., Tatekawa, H., Saito, K., Tsujimoto, Y ., Miki, Y ., and Ueda, D. A system- atic review and meta-analysis of diagnostic performance comparison between generative ai and p...
-
[16]
doi: 10.1016/s0140-6736(74) 91639-0
ISSN 0140-6736. doi: 10.1016/s0140-6736(74) 91639-0. URL http://dx.doi.org/10.1016/ s0140-6736(74)91639-0. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.Nature Medicine, 25(1):44–56, January 2019. ISSN 1546-170X. doi: 10.1038/s41591-018-0300-7. URL http://dx.doi. org/10.1038/s41591-018-0300-7. Ustun, B. and R...
-
[17]
ISSN 1364-8535. doi: 10.1186/cc8204. URL http://dx.doi.org/10.1186/cc8204. Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonc ¸a, A., Bruining, H., Reinhart, C. K., Suter, P. M., and Thijs, L. G. The sofa (sepsis-related or- gan failure assessment) score to describe organ dysfunc- tion/failure: On behalf of the working group on sepsis- relat...
-
[18]
ISSN 1432-1238. doi: 10.1007/bf01709751. URL http://dx.doi.org/10.1007/BF01709751. Wang, F. The crisis of biomedical foundation models. Journal of Biomedical Informatics, 171:104917, Novem- ber 2025. ISSN 1532-0464. doi: 10.1016/j.jbi.2025. 104917. URL http://dx.doi.org/10.1016/j. jbi.2025.104917. Wasylewicz, A. T. M. and Scheepers-Hoeks, A. M. J. W. Clin...
-
[19]
URL http:// dx.doi.org/10.7861/clinmed.2022-0435
doi: 10.7861/clinmed.2022-0435. URL http:// dx.doi.org/10.7861/clinmed.2022-0435. Wells, P., Anderson, D., Rodger, M., Ginsberg, J., Kearon, C., Gent, M., Turpie, A., Bormanis, J., Weitz, J., Cham- berlain, M., Bowie, D., Barnes, D., and Hirsh, J. Deriva- tion of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing ...
-
[20]
doi: 10.1016/s0140-6736(97) 08140-3
ISSN 0140-6736. doi: 10.1016/s0140-6736(97) 08140-3. URL http://dx.doi.org/10.1016/ S0140-6736(97)08140-3. Wieten, S. Expertise in evidence-based medicine: a tale of three models.Philosophy, Ethics, and Humanities in Medicine, 13(1), February 2018. ISSN 1747-5341. doi: 10.1186/s13010-018-0055-2. URL http://dx.doi. org/10.1186/s13010-018-0055-2. Williams, ...
-
[21]
demonstrate a distinct but equally important form of impact in emergency medicine, where a conservative binary checklist enables the safe exclusion of fracture and avoids unnecessary imaging, reducing cost and patient burden without increasing missed injuries. For the Ottawa Ankle Rules, the checklist is operationalized as a conservative OR-rule. 16 Agent...
work page 1956
-
[22]
MIMIC-IV(v3.1) (Johnson et al., 2023): A de-identified electronic health record (EHR) database from Beth Israel Deaconess Medical Center containing over 400,000 hospital admissions
work page 2023
-
[23]
22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems
eICU Collaborative Research Database(v2.0) (Pollard et al., 2018): A multicenter critical care database comprising ICU stays from 208 hospitals across the United States. 22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems
work page 2018
-
[24]
UK Cystic Fibrosis (CF) Registry: A national registry containing annual longitudinal follow-up records for individuals with cystic fibrosis in the United Kingdom
-
[25]
Canadian Cystic Fibrosis Registry: A national population-based registry used for external validation of CF mortality prediction
-
[26]
PhysioNet Challenge 2012(Silva et al., 2012): A publicly available ICU mortality benchmark comprising 8,000 patient episodes from two hospitals. Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria. We provide additional clarifications below to ensure precise reproducibility. Observation windows...
work page 2012
-
[27]
The “Generation–Selection” Gap:State-of-the-art interpretable solvers such as RiskSLIM and FasterRisk act asselectors, requiring a pre-computed feature matrix X∈ {0,1} N×|R univ|. They cannot generate semantic rules dynamically; any clinically meaningful derived constructs (ratios, trends, shallow logic) must be manually engineered and materialized as col...
-
[28]
Failure of Continuous Relaxation (e.g., Lasso):Relaxing w∈ {0,1} |Runiv| to continuous weights w∈R |Runiv| introduces a substantialintegrality gap: continuous relaxations induce fractional solutions, and rounding them can change the induced decision boundary and utility relative to the discrete optimum. In checklist learning, where both the score and the ...
-
[29]
Failure of Classical Heuristics (Genetic Algorithms, SA):Standard heuristic searches struggle with the semantic structure of the rule space: • Undefined Metric Space:Crossover operators in Genetic Algorithms require a meaningful metric space. It is unclear how to interpolate between “Age>65” and “Lactate<2.0”. • Sparse Fitness Landscape:A random mutation ...
-
[30]
Primitive Rules:WithT= 20quantile thresholds and range constraints, a crude count gives |Rprim| ≈p×(2T+T 2)≈50×440≈2.2×10 4
-
[31]
26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems
Compositional Rules:Allowing depth-1 logical operators (AND/OR) between pairs of primitives yields, up to constants, |Rcomp| ≈2· |Rprim| 2 ≈ O |Rprim|2 ≈(2.2×10 4)2 ≈4.8×10 8. 26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems
-
[32]
Additional Variants (Temporal + Ratios):Introducing simple temporal summaries (e.g., W= 4 windows ×3 stats = 12 variants) and a restricted set of arithmetic ratios/differences over variable pairs ( 50 2 ≈1225 ) increases the candidate universe by large multiplicative factors. An order-of-magnitude approximation is |Runiv| ≈ |R comp| ×(1 + 12 temporal)×(1 ...
work page 1980
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.