Automatic Construction of Clinical Scoring Systems with LLM Agents

Christopher Chiu; Mihaela van der Schaar; Silas Ruhrberg Est\'evez

arxiv: 2601.22324 · v2 · pith:S3ORUIZ5new · submitted 2026-01-29 · 💻 cs.LG · cs.MA

Automatic Construction of Clinical Scoring Systems with LLM Agents

Silas Ruhrberg Est\'evez , Christopher Chiu , Mihaela van der Schaar This is my paper

Pith reviewed 2026-05-25 07:21 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords clinical scoring systemsLLM agentsinterpretable modelsclinical predictionautomated rule generationunit-weighted checklistsguideline development

0 comments

The pith

LLM agents can generate clinical scoring systems that outperform prior methods and match flexible models on eight tasks while exceeding established guidelines on external validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentScore as a way to build simple clinical scoring systems by having LLMs propose candidate rules and then using a fixed verification loop to pick statistically sound ones that meet bedside constraints. Traditional machine learning often produces models too complex for routine use, while manual guideline scores are limited by the difficulty of searching all possible rule combinations. The method claims to close this gap by staying within unit-weighted checklists yet still delivering strong prediction. If the approach works, it would allow automatic creation of deployable scores that preserve interpretability without sacrificing accuracy.

Core claim

AgentScore performs semantically guided optimization in the space of unit-weighted clinical checklists by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

What carries the argument

AgentScore, an LLM-driven proposal step followed by a deterministic verification-and-selection loop that filters rules for validity and deployability.

If this is right

Clinical scoring systems can be produced automatically for many prediction tasks instead of requiring manual expert rule design.
Strong performance is possible even when scores are restricted to simple unit-weighted checklists of binary rules.
Automatically generated scores can exceed the discrimination of existing manual guidelines on externally validated tasks.
The structural limits of deployable guidelines need not prevent them from reaching accuracy levels close to less constrained interpretable models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proposal-and-verification pattern could be applied to create interpretable rule sets in non-clinical areas such as credit risk or safety checklists.
Even with the verification loop, proposed rules should be audited for unintended patterns that might reflect the LLM's pretraining data rather than the clinical dataset.
Future tests could measure whether the generated scores actually change clinician behavior or patient outcomes rather than only discrimination metrics.

Load-bearing premise

Rules proposed by the LLM and then filtered by the deterministic loop will produce scores whose performance holds up on new data without being distorted by the LLM's own training biases or selection artifacts.

What would settle it

Finding that AgentScore scores produce lower AUROC than established guideline scores when tested on a fresh external dataset drawn from a different hospital system or time period.

Figures

Figures reproduced from arXiv: 2601.22324 by Christopher Chiu, Mihaela van der Schaar, Silas Ruhrberg Est\'evez.

**Figure 1.** Figure 1: Clinical scoring systems: Guideline artifacts are compact, explicit checklists intended for reliable manual use. In parallel, increasingly complex machine learning models have achieved strong performance on clinical prediction tasks (Takita et al., 2025; Killock, 2020; Shickel et al., 2018). However, even when accurate and ostensibly interpretable, many such models remain poorly matched to guideline d… view at source ↗

**Figure 2.** Figure 2: Overview of AgentScore. An LLM-based proposal agent generates candidate rules from a dataset description and toolmediated aggregate statistics; it never receives patient-level records. Proposed rules are screened by a deterministic validation module enforcing statistical performance and grammar-level deployability constraints (e.g., complexity limits, unit weights). Statistically admissible rules are revi… view at source ↗

**Figure 3.** Figure 3: External validation against guidelines. AgentScore outperforms clinical guidelines in AUROC (mean ± std over 5 seeds; guidelines deterministic). 5.3. Practical Deployability We conducted a structured expert review of score deployability with a panel of N = 18 practicing clinicians (89% with ≥6 years of clinical experience) from six countries. Participants evaluated four representative AgentScore checkli… view at source ↗

**Figure 4.** Figure 4: Effect of LLM backbone choice. Performance of AgentScore across different language-model backbones. While GPT-5 attains the strongest results on average, GPT-4o and DeepSeek V3.2 exhibit comparable performance trends, suggesting limited sensitivity to the specific LLM used for rule proposal. The only apparent deviation from strict monotonicity occurs in the MIMIC Lung dataset, where the score bin of zero e… view at source ↗

**Figure 5.** Figure 5: Risk monotonicity across score values. Empirical outcome prevalence (percentage positive) as a function of the discrete AgentScore value. E.4. Effect of guideline size Throughout the paper, we restrict the maximum number of rules to Mmax = 6, reflecting a conservative upper bound for checklists that remain easily memorizable and executable at the bedside while achieving competitive predictive performance. … view at source ↗

**Figure 6.** Figure 6: Accuracy–complexity trade-off. AUROC as a function of the maximum number of rules allowed in the checklist. Performance improves monotonically with rule set size, illustrating a clear deployability–accuracy Pareto frontier. Statistical analysis. We evaluate predictive performance using AUROC. For AgentScore, predicted probabilities are computed as pˆ = count nrules , where count denotes the number of satis… view at source ↗

**Figure 7.** Figure 7: Rule type diversity across cross-validation. Distribution of rule families (as defined by the rule grammar) appearing in the final AgentScore checklists across five folds per dataset. Bars report the frequency with which each rule type is used, showing that learned scores draw on multiple rule families rather than collapsing to a single template. E.6. Clinician review of scoring system deployability We con… view at source ↗

**Figure 8.** Figure 8: Clinician evaluation of scoring system deployability (N = 18). Top row: Experience distribution (Q1) and Likert-scale responses for trust (Q2–Q3) and deployability preferences (Q4–Q6). Bottom row: Aggregated pairwise preferences across 4 clinical tasks (Q7–Q9; 72 judgments each) and overall model preference (Q10). Model A: FasterRisk; Model B: AgentScore. E.7. Wall-Clock Time We additionally report wall-cl… view at source ↗

read the original abstract

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentScore uses LLMs to propose rules for unit-weighted clinical scores then verifies them on data, but the abstract supplies almost no experimental detail to back the performance claims.

read the letter

The core idea is straightforward: LLMs suggest candidate binary rules for simple additive scores, then a deterministic loop checks them against data for validity and picks the best set under the constraints of clinical use. That framing for searching the discrete space of unit-weighted checklists looks new relative to earlier automated rule learning work. The paper correctly identifies why many high-performing ML models never reach guidelines—memorability, auditability, and bedside execution matter more than raw AUROC in practice. It also shows awareness that the search space is exponential, so pure enumeration fails and semantic guidance from the LLM is a plausible workaround. The external validation on two tasks is the part that could matter most if the numbers hold. The main weakness is that the abstract states outperformance across eight tasks and superiority to existing guideline scores without describing the baselines, the exact verification procedure, data splits, or any statistical testing. Without those, it is impossible to judge whether the verification loop actually prevents overfitting or post-hoc selection effects from the LLM proposals. The method description stays high-level, so circularity or leakage risks cannot be checked. This is aimed at researchers building deployable clinical tools rather than pure predictive modeling. A reader already working on interpretable models in medicine could extract the architecture and try it, but only if the full paper supplies reproducible methods and results. The problem is real and the approach is reasonable, so it deserves peer review even though the current write-up is too thin on evidence to assess the central claims.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AgentScore, a method that uses LLM agents to propose candidate rules for clinical scoring systems and applies a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints such as unit weighting and interpretability. It claims that across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite stronger structural constraints. On two additional externally validated tasks, it reports higher discrimination than established guideline-based scores.

Significance. If the empirical results hold under rigorous scrutiny, this work could advance automated construction of deployable clinical guidelines by using LLMs for semantic search over an exponentially large discrete rule space while maintaining statistical controls. The combination of LLM proposal with deterministic verification addresses a key practical barrier in translating ML to bedside tools.

major comments (2)

[Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.
[Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have made revisions to strengthen the manuscript's clarity on experimental details and validation protocols.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.

Authors: We agree that the abstract would benefit from additional context to support the central claims. In the revised version, we have expanded the abstract to briefly specify the eight clinical tasks, the score-generation baselines compared, the use of internal cross-validation and external test sets, and the reporting of AUROC with statistical significance testing. Full details on data splits, baseline definitions, and multiple-comparison corrections (Bonferroni-adjusted) remain in the Methods and Results sections, with a new sentence in the abstract directing readers there. We believe this addresses the concern without exceeding abstract length limits. revision: yes
Referee: [Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.

Authors: The verification-and-selection loop uses a three-way data split (proposal on training, selection on validation, evaluation on held-out test) to control post-hoc selection effects, with all final scores required to meet pre-specified statistical thresholds on the validation set before external testing. LLM proposal is treated as a semantic prior only; no LLM parameters are fitted to the clinical data. We have added an explicit subsection in Methods detailing the partitioning protocol, a sensitivity analysis removing the top-proposed rules, and a note that LLM training data overlap cannot be fully audited but is mitigated by the deterministic verification step. These additions provide the requested concrete validation protocols. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central method uses LLM agents to propose candidate rules followed by a deterministic, data-grounded verification-and-selection loop. Claims of outperformance are grounded in empirical AUROC evaluations on eight tasks plus external validation, not by construction from fitted parameters or self-referential definitions. No equations, ansatzes, or self-citations are shown that reduce results to inputs. The architecture separates proposal (LLM) from verification (external data), making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of specific free parameters, axioms, or invented entities. The approach implicitly assumes LLM proposals are sufficiently diverse and that the verification loop enforces generalizability without introducing selection bias.

pith-pipeline@v0.9.0 · 5720 in / 1082 out tokens · 27364 ms · 2026-05-25T07:21:23.867116+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization
stat.ME 2026-05 unverdicted novelty 5.0

Develops greedy optimization algorithms for directly learning optimal integer-weighted clinical risk scores, applied to predict post-discharge mortality in a large EHR cohort with a supporting simulation study.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

doi: 10.21037/atm.2019.11

ISSN 2305-5847. doi: 10.21037/atm.2019.11

work page doi:10.21037/atm.2019.11 2019
[2]

2019.11.121

URL http://dx.doi.org/10.21037/atm. 2019.11.121. Chen, R. T. Q., Rubanova, Y ., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equa- tions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc.,

work page doi:10.21037/atm 2019
[3]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 69386f6bb1dfed68692a24c8686939b9-Paper. pdf. Collins, G. S., Reitsma, J. B., Altman, D. G., and Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement.Annals of Internal Medicine, 162(1): 55–63, January 2015...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3399/bjgp20x708941 2018
[4]

doi: 10.1016/s1088-467x(97) 00008-5

ISSN 1088-467X. doi: 10.1016/s1088-467x(97) 00008-5. URL http://dx.doi.org/10.1016/ S1088-467X(97)00008-5. Dawes, R. M. The robust beauty of improper linear models in decision making.American Psychologist, 34(7):571–582, July 1979. ISSN 0003-066X. doi: 10.1037/0003-066x.34.7.571. URL http://dx.doi. org/10.1037/0003-066X.34.7.571. Desai, N. and Gross, J. S...

work page doi:10.1016/s1088-467x(97 1979
[5]

doi: 10.1016/0002-9343(94) 90143-0

ISSN 0002-9343. doi: 10.1016/0002-9343(94) 90143-0. URL http://dx.doi.org/10.1016/ 0002-9343(94)90143-0. D’Agostino, R. B., Vasan, R. S., Pencina, M. J., Wolf, P. A., Cobain, M., Massaro, J. M., and Kannel, W. B. General cardiovascular risk profile for use in primary care: The framingham heart study.Circulation, 117(6): 743–753, February 2008. ISSN 1524-4...

work page doi:10.1016/0002-9343(94 2008
[6]

Granger, C

ISBN 9780309164232. Granger, C. B. Predictors of hospital mortality in the global registry of acute coronary events.Archives of Internal Medicine, 163(19):2345, October 2003. ISSN 0003-9926. doi: 10.1001/archinte.163.19.2345. URL http://dx. doi.org/10.1001/archinte.163.19.2345. Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., Vi...

work page doi:10.1001/archinte.163.19.2345 2003
[7]

cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper. pdf. Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. Mimic-iv, a freely accessible electronic health record dataset.Scientific...

work page doi:10.1038/s41597-022-01899-x 2019
[8]

URL http://dx

doi: 10.1214/15-aoas848. URL http://dx. doi.org/10.1214/15-AOAS848. Lim, W S an van der Eerden, M. M., Laing, R., Boersma, W. G., Karalus, N., Town, G. I., Lewis, S. A., and Mac- farlane, J. T. Defining community acquired pneumo- nia severity on presentation to hospital: an international derivation and validation study.Thorax, 58(5):377–382, May 2003. ISS...

work page doi:10.1214/15-aoas848 2003
[9]

Liu, T., Huynh, N., and van der Schaar, M

URL https://openreview.net/forum? id=xTYL1J6Xt-z. Liu, T., Huynh, N., and van der Schaar, M. Decision tree in- duction through LLMs via semantically-aware evolution. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=UyhRtB4hjN. Lundberg, S. M. and Lee, S.-I. A unified approach to in- terpreti...

work page doi:10.1037/h0043158 2025
[10]

URL http: //dx.doi.org/10.1136/bmj-2024-082505

doi: 10.1136/bmj-2024-082505. URL http: //dx.doi.org/10.1136/bmj-2024-082505. Nam, J., Kim, K., Oh, S., Tack, J., Kim, J., and Shin, J. Optimized feature generation for tabular data via LLMs with decision tree reasoning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page doi:10.1136/bmj-2024-082505 2024
[11]

Nori, H., Jenkins, S., Koch, P., and Caruana, R

URL https://openreview.net/forum? id=APSBwuMopO. Nori, H., Jenkins, S., Koch, P., and Caruana, R. Interpretml: A unified framework for machine learning interpretabil- ity, 2019. URL https://arxiv.org/abs/1909. 09223. Olesen, J., Torp-Pedersen, C., Hansen, M., and Lip, G. The value of the cha2ds2-vasc score for refining stroke risk stratification in patien...

work page doi:10.1160/th12-03-0175 2019
[12]

doi: 10.1016/s0001-2998(78) 80013-0

ISSN 0001-2998. doi: 10.1016/s0001-2998(78) 80013-0. URL http://dx.doi.org/10.1016/ s0001-2998(78)80013-0. Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. The eicu collabora- tive research database, a freely available multi-center database for critical care research.Scientific Data, 5(1), September 2018. ISSN 2052...

work page doi:10.1016/s0001-2998(78 2018
[13]

URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002

doi: 10.1016/j.jcf.2019.03.002. URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002. Ribeiro, M. T., Singh, S., and Guestrin, C. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY , USA, 2016. Asso...

work page doi:10.1016/j.jcf.2019.03.002 2019
[14]

doi: 10.1001/jama.1993

ISSN 0098-7484. doi: 10.1001/jama.1993. 03500090063034. URL http://dx.doi.org/10. 1001/jama.1993.03500090063034. Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. Deep ehr: A survey of recent advances in deep learning tech- niques for electronic health record (ehr) analysis.IEEE Journal of Biomedical and Health Informatics, 22(5): 1589–1604, Septemb...

work page doi:10.1001/jama.1993 1993
[15]

URL http:// dx.doi.org/10.4103/0970-1591.91438

doi: 10.4103/0970-1591.91438. URL http:// dx.doi.org/10.4103/0970-1591.91438. 13 AgentScore: Autoformulation of Deployable Clinical Scoring Systems Takita, H., Kabata, D., Walston, S. L., Tatekawa, H., Saito, K., Tsujimoto, Y ., Miki, Y ., and Ueda, D. A system- atic review and meta-analysis of diagnostic performance comparison between generative ai and p...

work page doi:10.4103/0970-1591.91438 2025
[16]

doi: 10.1016/s0140-6736(74) 91639-0

ISSN 0140-6736. doi: 10.1016/s0140-6736(74) 91639-0. URL http://dx.doi.org/10.1016/ s0140-6736(74)91639-0. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.Nature Medicine, 25(1):44–56, January 2019. ISSN 1546-170X. doi: 10.1038/s41591-018-0300-7. URL http://dx.doi. org/10.1038/s41591-018-0300-7. Ustun, B. and R...

work page doi:10.1016/s0140-6736(74 2019
[17]

doi: 10.1186/cc8204

ISSN 1364-8535. doi: 10.1186/cc8204. URL http://dx.doi.org/10.1186/cc8204. Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonc ¸a, A., Bruining, H., Reinhart, C. K., Suter, P. M., and Thijs, L. G. The sofa (sepsis-related or- gan failure assessment) score to describe organ dysfunc- tion/failure: On behalf of the working group on sepsis- relat...

work page doi:10.1186/cc8204
[18]

doi: 10.1007/bf01709751

ISSN 1432-1238. doi: 10.1007/bf01709751. URL http://dx.doi.org/10.1007/BF01709751. Wang, F. The crisis of biomedical foundation models. Journal of Biomedical Informatics, 171:104917, Novem- ber 2025. ISSN 1532-0464. doi: 10.1016/j.jbi.2025. 104917. URL http://dx.doi.org/10.1016/j. jbi.2025.104917. Wasylewicz, A. T. M. and Scheepers-Hoeks, A. M. J. W. Clin...

work page doi:10.1007/bf01709751 2025
[19]

URL http:// dx.doi.org/10.7861/clinmed.2022-0435

doi: 10.7861/clinmed.2022-0435. URL http:// dx.doi.org/10.7861/clinmed.2022-0435. Wells, P., Anderson, D., Rodger, M., Ginsberg, J., Kearon, C., Gent, M., Turpie, A., Bormanis, J., Weitz, J., Cham- berlain, M., Bowie, D., Barnes, D., and Hirsh, J. Deriva- tion of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing ...

work page doi:10.7861/clinmed.2022-0435 2022
[20]

doi: 10.1016/s0140-6736(97) 08140-3

ISSN 0140-6736. doi: 10.1016/s0140-6736(97) 08140-3. URL http://dx.doi.org/10.1016/ S0140-6736(97)08140-3. Wieten, S. Expertise in evidence-based medicine: a tale of three models.Philosophy, Ethics, and Humanities in Medicine, 13(1), February 2018. ISSN 1747-5341. doi: 10.1186/s13010-018-0055-2. URL http://dx.doi. org/10.1186/s13010-018-0055-2. Williams, ...

work page doi:10.1016/s0140-6736(97 2018
[21]

normal vs abnormal

demonstrate a distinct but equally important form of impact in emergency medicine, where a conservative binary checklist enables the safe exclusion of fracture and avoids unnecessary imaging, reducing cost and patient burden without increasing missed injuries. For the Ottawa Ankle Rules, the checklist is operationalized as a conservative OR-rule. 16 Agent...

work page 1956
[22]

MIMIC-IV(v3.1) (Johnson et al., 2023): A de-identified electronic health record (EHR) database from Beth Israel Deaconess Medical Center containing over 400,000 hospital admissions

work page 2023
[23]

22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

eICU Collaborative Research Database(v2.0) (Pollard et al., 2018): A multicenter critical care database comprising ICU stays from 208 hospitals across the United States. 22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

work page 2018
[24]

UK Cystic Fibrosis (CF) Registry: A national registry containing annual longitudinal follow-up records for individuals with cystic fibrosis in the United Kingdom

work page
[25]

Canadian Cystic Fibrosis Registry: A national population-based registry used for external validation of CF mortality prediction

work page
[26]

Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria

PhysioNet Challenge 2012(Silva et al., 2012): A publicly available ICU mortality benchmark comprising 8,000 patient episodes from two hospitals. Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria. We provide additional clarifications below to ensure precise reproducibility. Observation windows...

work page 2012
[27]

Generation–Selection

The “Generation–Selection” Gap:State-of-the-art interpretable solvers such as RiskSLIM and FasterRisk act asselectors, requiring a pre-computed feature matrix X∈ {0,1} N×|R univ|. They cannot generate semantic rules dynamically; any clinically meaningful derived constructs (ratios, trends, shallow logic) must be manually engineered and materialized as col...

work page
[28]

In checklist learning, where both the score and the operating threshold are discrete, such rounding effects are often amplified

Failure of Continuous Relaxation (e.g., Lasso):Relaxing w∈ {0,1} |Runiv| to continuous weights w∈R |Runiv| introduces a substantialintegrality gap: continuous relaxations induce fractional solutions, and rounding them can change the induced decision boundary and utility relative to the discrete optimum. In checklist learning, where both the score and the ...

work page
[29]

Age>65” and “Lactate<2.0

Failure of Classical Heuristics (Genetic Algorithms, SA):Standard heuristic searches struggle with the semantic structure of the rule space: • Undefined Metric Space:Crossover operators in Genetic Algorithms require a meaningful metric space. It is unclear how to interpolate between “Age>65” and “Lactate<2.0”. • Sparse Fitness Landscape:A random mutation ...

work page
[30]

Primitive Rules:WithT= 20quantile thresholds and range constraints, a crude count gives |Rprim| ≈p×(2T+T 2)≈50×440≈2.2×10 4

work page
[31]

26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

Compositional Rules:Allowing depth-1 logical operators (AND/OR) between pairs of primitives yields, up to constants, |Rcomp| ≈2· |Rprim| 2 ≈ O |Rprim|2 ≈(2.2×10 4)2 ≈4.8×10 8. 26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

work page
[32]

cold start

Additional Variants (Temporal + Ratios):Introducing simple temporal summaries (e.g., W= 4 windows ×3 stats = 12 variants) and a restricted set of arithmetic ratios/differences over variable pairs ( 50 2 ≈1225 ) increases the candidate universe by large multiplicative factors. An order-of-magnitude approximation is |Runiv| ≈ |R comp| ×(1 + 12 temporal)×(1 ...

work page 1980

[1] [1]

doi: 10.21037/atm.2019.11

ISSN 2305-5847. doi: 10.21037/atm.2019.11

work page doi:10.21037/atm.2019.11 2019

[2] [2]

2019.11.121

URL http://dx.doi.org/10.21037/atm. 2019.11.121. Chen, R. T. Q., Rubanova, Y ., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equa- tions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc.,

work page doi:10.21037/atm 2019

[3] [3]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 69386f6bb1dfed68692a24c8686939b9-Paper. pdf. Collins, G. S., Reitsma, J. B., Altman, D. G., and Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement.Annals of Internal Medicine, 162(1): 55–63, January 2015...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3399/bjgp20x708941 2018

[4] [4]

doi: 10.1016/s1088-467x(97) 00008-5

ISSN 1088-467X. doi: 10.1016/s1088-467x(97) 00008-5. URL http://dx.doi.org/10.1016/ S1088-467X(97)00008-5. Dawes, R. M. The robust beauty of improper linear models in decision making.American Psychologist, 34(7):571–582, July 1979. ISSN 0003-066X. doi: 10.1037/0003-066x.34.7.571. URL http://dx.doi. org/10.1037/0003-066X.34.7.571. Desai, N. and Gross, J. S...

work page doi:10.1016/s1088-467x(97 1979

[5] [5]

doi: 10.1016/0002-9343(94) 90143-0

ISSN 0002-9343. doi: 10.1016/0002-9343(94) 90143-0. URL http://dx.doi.org/10.1016/ 0002-9343(94)90143-0. D’Agostino, R. B., Vasan, R. S., Pencina, M. J., Wolf, P. A., Cobain, M., Massaro, J. M., and Kannel, W. B. General cardiovascular risk profile for use in primary care: The framingham heart study.Circulation, 117(6): 743–753, February 2008. ISSN 1524-4...

work page doi:10.1016/0002-9343(94 2008

[6] [6]

Granger, C

ISBN 9780309164232. Granger, C. B. Predictors of hospital mortality in the global registry of acute coronary events.Archives of Internal Medicine, 163(19):2345, October 2003. ISSN 0003-9926. doi: 10.1001/archinte.163.19.2345. URL http://dx. doi.org/10.1001/archinte.163.19.2345. Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., Vi...

work page doi:10.1001/archinte.163.19.2345 2003

[7] [7]

cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper. pdf. Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. Mimic-iv, a freely accessible electronic health record dataset.Scientific...

work page doi:10.1038/s41597-022-01899-x 2019

[8] [8]

URL http://dx

doi: 10.1214/15-aoas848. URL http://dx. doi.org/10.1214/15-AOAS848. Lim, W S an van der Eerden, M. M., Laing, R., Boersma, W. G., Karalus, N., Town, G. I., Lewis, S. A., and Mac- farlane, J. T. Defining community acquired pneumo- nia severity on presentation to hospital: an international derivation and validation study.Thorax, 58(5):377–382, May 2003. ISS...

work page doi:10.1214/15-aoas848 2003

[9] [9]

Liu, T., Huynh, N., and van der Schaar, M

URL https://openreview.net/forum? id=xTYL1J6Xt-z. Liu, T., Huynh, N., and van der Schaar, M. Decision tree in- duction through LLMs via semantically-aware evolution. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=UyhRtB4hjN. Lundberg, S. M. and Lee, S.-I. A unified approach to in- terpreti...

work page doi:10.1037/h0043158 2025

[10] [10]

URL http: //dx.doi.org/10.1136/bmj-2024-082505

doi: 10.1136/bmj-2024-082505. URL http: //dx.doi.org/10.1136/bmj-2024-082505. Nam, J., Kim, K., Oh, S., Tack, J., Kim, J., and Shin, J. Optimized feature generation for tabular data via LLMs with decision tree reasoning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page doi:10.1136/bmj-2024-082505 2024

[11] [11]

Nori, H., Jenkins, S., Koch, P., and Caruana, R

URL https://openreview.net/forum? id=APSBwuMopO. Nori, H., Jenkins, S., Koch, P., and Caruana, R. Interpretml: A unified framework for machine learning interpretabil- ity, 2019. URL https://arxiv.org/abs/1909. 09223. Olesen, J., Torp-Pedersen, C., Hansen, M., and Lip, G. The value of the cha2ds2-vasc score for refining stroke risk stratification in patien...

work page doi:10.1160/th12-03-0175 2019

[12] [12]

doi: 10.1016/s0001-2998(78) 80013-0

ISSN 0001-2998. doi: 10.1016/s0001-2998(78) 80013-0. URL http://dx.doi.org/10.1016/ s0001-2998(78)80013-0. Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. The eicu collabora- tive research database, a freely available multi-center database for critical care research.Scientific Data, 5(1), September 2018. ISSN 2052...

work page doi:10.1016/s0001-2998(78 2018

[13] [13]

URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002

doi: 10.1016/j.jcf.2019.03.002. URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002. Ribeiro, M. T., Singh, S., and Guestrin, C. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY , USA, 2016. Asso...

work page doi:10.1016/j.jcf.2019.03.002 2019

[14] [14]

doi: 10.1001/jama.1993

ISSN 0098-7484. doi: 10.1001/jama.1993. 03500090063034. URL http://dx.doi.org/10. 1001/jama.1993.03500090063034. Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. Deep ehr: A survey of recent advances in deep learning tech- niques for electronic health record (ehr) analysis.IEEE Journal of Biomedical and Health Informatics, 22(5): 1589–1604, Septemb...

work page doi:10.1001/jama.1993 1993

[15] [15]

URL http:// dx.doi.org/10.4103/0970-1591.91438

doi: 10.4103/0970-1591.91438. URL http:// dx.doi.org/10.4103/0970-1591.91438. 13 AgentScore: Autoformulation of Deployable Clinical Scoring Systems Takita, H., Kabata, D., Walston, S. L., Tatekawa, H., Saito, K., Tsujimoto, Y ., Miki, Y ., and Ueda, D. A system- atic review and meta-analysis of diagnostic performance comparison between generative ai and p...

work page doi:10.4103/0970-1591.91438 2025

[16] [16]

doi: 10.1016/s0140-6736(74) 91639-0

ISSN 0140-6736. doi: 10.1016/s0140-6736(74) 91639-0. URL http://dx.doi.org/10.1016/ s0140-6736(74)91639-0. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.Nature Medicine, 25(1):44–56, January 2019. ISSN 1546-170X. doi: 10.1038/s41591-018-0300-7. URL http://dx.doi. org/10.1038/s41591-018-0300-7. Ustun, B. and R...

work page doi:10.1016/s0140-6736(74 2019

[17] [17]

doi: 10.1186/cc8204

ISSN 1364-8535. doi: 10.1186/cc8204. URL http://dx.doi.org/10.1186/cc8204. Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonc ¸a, A., Bruining, H., Reinhart, C. K., Suter, P. M., and Thijs, L. G. The sofa (sepsis-related or- gan failure assessment) score to describe organ dysfunc- tion/failure: On behalf of the working group on sepsis- relat...

work page doi:10.1186/cc8204

[18] [18]

doi: 10.1007/bf01709751

ISSN 1432-1238. doi: 10.1007/bf01709751. URL http://dx.doi.org/10.1007/BF01709751. Wang, F. The crisis of biomedical foundation models. Journal of Biomedical Informatics, 171:104917, Novem- ber 2025. ISSN 1532-0464. doi: 10.1016/j.jbi.2025. 104917. URL http://dx.doi.org/10.1016/j. jbi.2025.104917. Wasylewicz, A. T. M. and Scheepers-Hoeks, A. M. J. W. Clin...

work page doi:10.1007/bf01709751 2025

[19] [19]

URL http:// dx.doi.org/10.7861/clinmed.2022-0435

doi: 10.7861/clinmed.2022-0435. URL http:// dx.doi.org/10.7861/clinmed.2022-0435. Wells, P., Anderson, D., Rodger, M., Ginsberg, J., Kearon, C., Gent, M., Turpie, A., Bormanis, J., Weitz, J., Cham- berlain, M., Bowie, D., Barnes, D., and Hirsh, J. Deriva- tion of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing ...

work page doi:10.7861/clinmed.2022-0435 2022

[20] [20]

doi: 10.1016/s0140-6736(97) 08140-3

ISSN 0140-6736. doi: 10.1016/s0140-6736(97) 08140-3. URL http://dx.doi.org/10.1016/ S0140-6736(97)08140-3. Wieten, S. Expertise in evidence-based medicine: a tale of three models.Philosophy, Ethics, and Humanities in Medicine, 13(1), February 2018. ISSN 1747-5341. doi: 10.1186/s13010-018-0055-2. URL http://dx.doi. org/10.1186/s13010-018-0055-2. Williams, ...

work page doi:10.1016/s0140-6736(97 2018

[21] [21]

normal vs abnormal

demonstrate a distinct but equally important form of impact in emergency medicine, where a conservative binary checklist enables the safe exclusion of fracture and avoids unnecessary imaging, reducing cost and patient burden without increasing missed injuries. For the Ottawa Ankle Rules, the checklist is operationalized as a conservative OR-rule. 16 Agent...

work page 1956

[22] [22]

MIMIC-IV(v3.1) (Johnson et al., 2023): A de-identified electronic health record (EHR) database from Beth Israel Deaconess Medical Center containing over 400,000 hospital admissions

work page 2023

[23] [23]

22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

eICU Collaborative Research Database(v2.0) (Pollard et al., 2018): A multicenter critical care database comprising ICU stays from 208 hospitals across the United States. 22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

work page 2018

[24] [24]

UK Cystic Fibrosis (CF) Registry: A national registry containing annual longitudinal follow-up records for individuals with cystic fibrosis in the United Kingdom

work page

[25] [25]

Canadian Cystic Fibrosis Registry: A national population-based registry used for external validation of CF mortality prediction

work page

[26] [26]

Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria

PhysioNet Challenge 2012(Silva et al., 2012): A publicly available ICU mortality benchmark comprising 8,000 patient episodes from two hospitals. Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria. We provide additional clarifications below to ensure precise reproducibility. Observation windows...

work page 2012

[27] [27]

Generation–Selection

The “Generation–Selection” Gap:State-of-the-art interpretable solvers such as RiskSLIM and FasterRisk act asselectors, requiring a pre-computed feature matrix X∈ {0,1} N×|R univ|. They cannot generate semantic rules dynamically; any clinically meaningful derived constructs (ratios, trends, shallow logic) must be manually engineered and materialized as col...

work page

[28] [28]

In checklist learning, where both the score and the operating threshold are discrete, such rounding effects are often amplified

Failure of Continuous Relaxation (e.g., Lasso):Relaxing w∈ {0,1} |Runiv| to continuous weights w∈R |Runiv| introduces a substantialintegrality gap: continuous relaxations induce fractional solutions, and rounding them can change the induced decision boundary and utility relative to the discrete optimum. In checklist learning, where both the score and the ...

work page

[29] [29]

Age>65” and “Lactate<2.0

Failure of Classical Heuristics (Genetic Algorithms, SA):Standard heuristic searches struggle with the semantic structure of the rule space: • Undefined Metric Space:Crossover operators in Genetic Algorithms require a meaningful metric space. It is unclear how to interpolate between “Age>65” and “Lactate<2.0”. • Sparse Fitness Landscape:A random mutation ...

work page

[30] [30]

Primitive Rules:WithT= 20quantile thresholds and range constraints, a crude count gives |Rprim| ≈p×(2T+T 2)≈50×440≈2.2×10 4

work page

[31] [31]

26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

Compositional Rules:Allowing depth-1 logical operators (AND/OR) between pairs of primitives yields, up to constants, |Rcomp| ≈2· |Rprim| 2 ≈ O |Rprim|2 ≈(2.2×10 4)2 ≈4.8×10 8. 26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

work page

[32] [32]

cold start

Additional Variants (Temporal + Ratios):Introducing simple temporal summaries (e.g., W= 4 windows ×3 stats = 12 variants) and a restricted set of arithmetic ratios/differences over variable pairs ( 50 2 ≈1225 ) increases the candidate universe by large multiplicative factors. An order-of-magnitude approximation is |Runiv| ≈ |R comp| ×(1 + 12 temporal)×(1 ...

work page 1980