pith. sign in

arxiv: 2601.22324 · v2 · pith:S3ORUIZ5new · submitted 2026-01-29 · 💻 cs.LG · cs.MA

Automatic Construction of Clinical Scoring Systems with LLM Agents

Pith reviewed 2026-05-25 07:21 UTC · model grok-4.3

classification 💻 cs.LG cs.MA
keywords clinical scoring systemsLLM agentsinterpretable modelsclinical predictionautomated rule generationunit-weighted checklistsguideline development
0
0 comments X

The pith

LLM agents can generate clinical scoring systems that outperform prior methods and match flexible models on eight tasks while exceeding established guidelines on external validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentScore as a way to build simple clinical scoring systems by having LLMs propose candidate rules and then using a fixed verification loop to pick statistically sound ones that meet bedside constraints. Traditional machine learning often produces models too complex for routine use, while manual guideline scores are limited by the difficulty of searching all possible rule combinations. The method claims to close this gap by staying within unit-weighted checklists yet still delivering strong prediction. If the approach works, it would allow automatic creation of deployable scores that preserve interpretability without sacrificing accuracy.

Core claim

AgentScore performs semantically guided optimization in the space of unit-weighted clinical checklists by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

What carries the argument

AgentScore, an LLM-driven proposal step followed by a deterministic verification-and-selection loop that filters rules for validity and deployability.

If this is right

  • Clinical scoring systems can be produced automatically for many prediction tasks instead of requiring manual expert rule design.
  • Strong performance is possible even when scores are restricted to simple unit-weighted checklists of binary rules.
  • Automatically generated scores can exceed the discrimination of existing manual guidelines on externally validated tasks.
  • The structural limits of deployable guidelines need not prevent them from reaching accuracy levels close to less constrained interpretable models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proposal-and-verification pattern could be applied to create interpretable rule sets in non-clinical areas such as credit risk or safety checklists.
  • Even with the verification loop, proposed rules should be audited for unintended patterns that might reflect the LLM's pretraining data rather than the clinical dataset.
  • Future tests could measure whether the generated scores actually change clinician behavior or patient outcomes rather than only discrimination metrics.

Load-bearing premise

Rules proposed by the LLM and then filtered by the deterministic loop will produce scores whose performance holds up on new data without being distorted by the LLM's own training biases or selection artifacts.

What would settle it

Finding that AgentScore scores produce lower AUROC than established guideline scores when tested on a fresh external dataset drawn from a different hospital system or time period.

Figures

Figures reproduced from arXiv: 2601.22324 by Christopher Chiu, Mihaela van der Schaar, Silas Ruhrberg Est\'evez.

Figure 1
Figure 1. Figure 1: Clinical scoring systems: Guideline artifacts are com￾pact, explicit checklists intended for reliable manual use. In parallel, increasingly complex machine learning mod￾els have achieved strong performance on clinical predic￾tion tasks (Takita et al., 2025; Killock, 2020; Shickel et al., 2018). However, even when accurate and ostensibly inter￾pretable, many such models remain poorly matched to guide￾line d… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AgentScore. An LLM-based proposal agent generates candidate rules from a dataset description and tool￾mediated aggregate statistics; it never receives patient-level records. Proposed rules are screened by a deterministic validation module enforcing statistical performance and grammar-level deployability constraints (e.g., complexity limits, unit weights). Statistically admissible rules are revi… view at source ↗
Figure 3
Figure 3. Figure 3: External validation against guidelines. AgentScore outperforms clinical guidelines in AUROC (mean ± std over 5 seeds; guidelines deterministic). 5.3. Practical Deployability We conducted a structured expert review of score deployabil￾ity with a panel of N = 18 practicing clinicians (89% with ≥6 years of clinical experience) from six countries. Partic￾ipants evaluated four representative AgentScore check￾li… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of LLM backbone choice. Performance of AgentScore across different language-model backbones. While GPT-5 attains the strongest results on average, GPT-4o and DeepSeek V3.2 exhibit comparable performance trends, suggesting limited sensitivity to the specific LLM used for rule proposal. The only apparent deviation from strict monotonicity occurs in the MIMIC Lung dataset, where the score bin of zero e… view at source ↗
Figure 5
Figure 5. Figure 5: Risk monotonicity across score values. Empirical outcome prevalence (percentage positive) as a function of the discrete AgentScore value. E.4. Effect of guideline size Throughout the paper, we restrict the maximum number of rules to Mmax = 6, reflecting a conservative upper bound for checklists that remain easily memorizable and executable at the bedside while achieving competitive predictive performance. … view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy–complexity trade-off. AUROC as a function of the maximum number of rules allowed in the checklist. Performance improves monotonically with rule set size, illustrating a clear deployability–accuracy Pareto frontier. Statistical analysis. We evaluate predictive performance using AUROC. For AgentScore, predicted probabilities are computed as pˆ = count nrules , where count denotes the number of satis… view at source ↗
Figure 7
Figure 7. Figure 7: Rule type diversity across cross-validation. Distribution of rule families (as defined by the rule grammar) appearing in the final AgentScore checklists across five folds per dataset. Bars report the frequency with which each rule type is used, showing that learned scores draw on multiple rule families rather than collapsing to a single template. E.6. Clinician review of scoring system deployability We con… view at source ↗
Figure 8
Figure 8. Figure 8: Clinician evaluation of scoring system deployability (N = 18). Top row: Experience distribution (Q1) and Likert-scale responses for trust (Q2–Q3) and deployability preferences (Q4–Q6). Bottom row: Aggregated pairwise preferences across 4 clinical tasks (Q7–Q9; 72 judgments each) and overall model preference (Q10). Model A: FasterRisk; Model B: AgentScore. E.7. Wall-Clock Time We additionally report wall-cl… view at source ↗
read the original abstract

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AgentScore, a method that uses LLM agents to propose candidate rules for clinical scoring systems and applies a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints such as unit weighting and interpretability. It claims that across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite stronger structural constraints. On two additional externally validated tasks, it reports higher discrimination than established guideline-based scores.

Significance. If the empirical results hold under rigorous scrutiny, this work could advance automated construction of deployable clinical guidelines by using LLMs for semantic search over an exponentially large discrete rule space while maintaining statistical controls. The combination of LLM proposal with deterministic verification addresses a key practical barrier in translating ML to bedside tools.

major comments (2)
  1. [Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.
  2. [Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have made revisions to strengthen the manuscript's clarity on experimental details and validation protocols.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that AgentScore 'outperforms existing score-generation methods' across eight tasks and achieves 'higher discrimination than established guideline-based scores' on two external tasks is load-bearing, yet the abstract supplies no experimental details, baseline definitions, data splits, statistical tests, or multiple-comparison corrections. This prevents assessment of whether the reported gains are robust.

    Authors: We agree that the abstract would benefit from additional context to support the central claims. In the revised version, we have expanded the abstract to briefly specify the eight clinical tasks, the score-generation baselines compared, the use of internal cross-validation and external test sets, and the reporting of AUROC with statistical significance testing. Full details on data splits, baseline definitions, and multiple-comparison corrections (Bonferroni-adjusted) remain in the Methods and Results sections, with a new sentence in the abstract directing readers there. We believe this addresses the concern without exceeding abstract length limits. revision: yes

  2. Referee: [Methods] Methods (verification-and-selection loop): The approach relies on external data verification rather than reducing performance to quantities defined by fitted parameters internal to the model. Without explicit controls for LLM training biases or post-hoc selection effects from the proposal step, the generalization claims on external tasks rest on an assumption that requires concrete validation protocols.

    Authors: The verification-and-selection loop uses a three-way data split (proposal on training, selection on validation, evaluation on held-out test) to control post-hoc selection effects, with all final scores required to meet pre-specified statistical thresholds on the validation set before external testing. LLM proposal is treated as a semantic prior only; no LLM parameters are fitted to the clinical data. We have added an explicit subsection in Methods detailing the partitioning protocol, a sensitivity analysis removing the top-proposed rules, and a note that LLM training data overlap cannot be fully audited but is mitigated by the deterministic verification step. These additions provide the requested concrete validation protocols. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central method uses LLM agents to propose candidate rules followed by a deterministic, data-grounded verification-and-selection loop. Claims of outperformance are grounded in empirical AUROC evaluations on eight tasks plus external validation, not by construction from fitted parameters or self-referential definitions. No equations, ansatzes, or self-citations are shown that reduce results to inputs. The architecture separates proposal (LLM) from verification (external data), making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of specific free parameters, axioms, or invented entities. The approach implicitly assumes LLM proposals are sufficiently diverse and that the verification loop enforces generalizability without introducing selection bias.

pith-pipeline@v0.9.0 · 5720 in / 1082 out tokens · 27364 ms · 2026-05-25T07:21:23.867116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization

    stat.ME 2026-05 unverdicted novelty 5.0

    Develops greedy optimization algorithms for directly learning optimal integer-weighted clinical risk scores, applied to predict post-discharge mortality in a large EHR cohort with a supporting simulation study.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    doi: 10.21037/atm.2019.11

    ISSN 2305-5847. doi: 10.21037/atm.2019.11

  2. [2]

    2019.11.121

    URL http://dx.doi.org/10.21037/atm. 2019.11.121. Chen, R. T. Q., Rubanova, Y ., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equa- tions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information Process- ing Systems, volume 31. Curran Associates, Inc.,

  3. [3]

    Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

    URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 69386f6bb1dfed68692a24c8686939b9-Paper. pdf. Collins, G. S., Reitsma, J. B., Altman, D. G., and Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement.Annals of Internal Medicine, 162(1): 55–63, January 2015...

  4. [4]

    doi: 10.1016/s1088-467x(97) 00008-5

    ISSN 1088-467X. doi: 10.1016/s1088-467x(97) 00008-5. URL http://dx.doi.org/10.1016/ S1088-467X(97)00008-5. Dawes, R. M. The robust beauty of improper linear models in decision making.American Psychologist, 34(7):571–582, July 1979. ISSN 0003-066X. doi: 10.1037/0003-066x.34.7.571. URL http://dx.doi. org/10.1037/0003-066X.34.7.571. Desai, N. and Gross, J. S...

  5. [5]

    doi: 10.1016/0002-9343(94) 90143-0

    ISSN 0002-9343. doi: 10.1016/0002-9343(94) 90143-0. URL http://dx.doi.org/10.1016/ 0002-9343(94)90143-0. D’Agostino, R. B., Vasan, R. S., Pencina, M. J., Wolf, P. A., Cobain, M., Massaro, J. M., and Kannel, W. B. General cardiovascular risk profile for use in primary care: The framingham heart study.Circulation, 117(6): 743–753, February 2008. ISSN 1524-4...

  6. [6]

    Granger, C

    ISBN 9780309164232. Granger, C. B. Predictors of hospital mortality in the global registry of acute coronary events.Archives of Internal Medicine, 163(19):2345, October 2003. ISSN 0003-9926. doi: 10.1001/archinte.163.19.2345. URL http://dx. doi.org/10.1001/archinte.163.19.2345. Hager, P., Jungmann, F., Holland, R., Bhagat, K., Hubrecht, I., Knauer, M., Vi...

  7. [7]

    cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ ac52c626afc10d4075708ac4c778ddfc-Paper. pdf. Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. Mimic-iv, a freely accessible electronic health record dataset.Scientific...

  8. [8]

    URL http://dx

    doi: 10.1214/15-aoas848. URL http://dx. doi.org/10.1214/15-AOAS848. Lim, W S an van der Eerden, M. M., Laing, R., Boersma, W. G., Karalus, N., Town, G. I., Lewis, S. A., and Mac- farlane, J. T. Defining community acquired pneumo- nia severity on presentation to hospital: an international derivation and validation study.Thorax, 58(5):377–382, May 2003. ISS...

  9. [9]

    Liu, T., Huynh, N., and van der Schaar, M

    URL https://openreview.net/forum? id=xTYL1J6Xt-z. Liu, T., Huynh, N., and van der Schaar, M. Decision tree in- duction through LLMs via semantically-aware evolution. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=UyhRtB4hjN. Lundberg, S. M. and Lee, S.-I. A unified approach to in- terpreti...

  10. [10]

    URL http: //dx.doi.org/10.1136/bmj-2024-082505

    doi: 10.1136/bmj-2024-082505. URL http: //dx.doi.org/10.1136/bmj-2024-082505. Nam, J., Kim, K., Oh, S., Tack, J., Kim, J., and Shin, J. Optimized feature generation for tabular data via LLMs with decision tree reasoning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  11. [11]

    Nori, H., Jenkins, S., Koch, P., and Caruana, R

    URL https://openreview.net/forum? id=APSBwuMopO. Nori, H., Jenkins, S., Koch, P., and Caruana, R. Interpretml: A unified framework for machine learning interpretabil- ity, 2019. URL https://arxiv.org/abs/1909. 09223. Olesen, J., Torp-Pedersen, C., Hansen, M., and Lip, G. The value of the cha2ds2-vasc score for refining stroke risk stratification in patien...

  12. [12]

    doi: 10.1016/s0001-2998(78) 80013-0

    ISSN 0001-2998. doi: 10.1016/s0001-2998(78) 80013-0. URL http://dx.doi.org/10.1016/ s0001-2998(78)80013-0. Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. The eicu collabora- tive research database, a freely available multi-center database for critical care research.Scientific Data, 5(1), September 2018. ISSN 2052...

  13. [13]

    URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002

    doi: 10.1016/j.jcf.2019.03.002. URL http:// dx.doi.org/10.1016/j.jcf.2019.03.002. Ribeiro, M. T., Singh, S., and Guestrin, C. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY , USA, 2016. Asso...

  14. [14]

    doi: 10.1001/jama.1993

    ISSN 0098-7484. doi: 10.1001/jama.1993. 03500090063034. URL http://dx.doi.org/10. 1001/jama.1993.03500090063034. Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. Deep ehr: A survey of recent advances in deep learning tech- niques for electronic health record (ehr) analysis.IEEE Journal of Biomedical and Health Informatics, 22(5): 1589–1604, Septemb...

  15. [15]

    URL http:// dx.doi.org/10.4103/0970-1591.91438

    doi: 10.4103/0970-1591.91438. URL http:// dx.doi.org/10.4103/0970-1591.91438. 13 AgentScore: Autoformulation of Deployable Clinical Scoring Systems Takita, H., Kabata, D., Walston, S. L., Tatekawa, H., Saito, K., Tsujimoto, Y ., Miki, Y ., and Ueda, D. A system- atic review and meta-analysis of diagnostic performance comparison between generative ai and p...

  16. [16]

    doi: 10.1016/s0140-6736(74) 91639-0

    ISSN 0140-6736. doi: 10.1016/s0140-6736(74) 91639-0. URL http://dx.doi.org/10.1016/ s0140-6736(74)91639-0. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence.Nature Medicine, 25(1):44–56, January 2019. ISSN 1546-170X. doi: 10.1038/s41591-018-0300-7. URL http://dx.doi. org/10.1038/s41591-018-0300-7. Ustun, B. and R...

  17. [17]

    doi: 10.1186/cc8204

    ISSN 1364-8535. doi: 10.1186/cc8204. URL http://dx.doi.org/10.1186/cc8204. Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonc ¸a, A., Bruining, H., Reinhart, C. K., Suter, P. M., and Thijs, L. G. The sofa (sepsis-related or- gan failure assessment) score to describe organ dysfunc- tion/failure: On behalf of the working group on sepsis- relat...

  18. [18]

    doi: 10.1007/bf01709751

    ISSN 1432-1238. doi: 10.1007/bf01709751. URL http://dx.doi.org/10.1007/BF01709751. Wang, F. The crisis of biomedical foundation models. Journal of Biomedical Informatics, 171:104917, Novem- ber 2025. ISSN 1532-0464. doi: 10.1016/j.jbi.2025. 104917. URL http://dx.doi.org/10.1016/j. jbi.2025.104917. Wasylewicz, A. T. M. and Scheepers-Hoeks, A. M. J. W. Clin...

  19. [19]

    URL http:// dx.doi.org/10.7861/clinmed.2022-0435

    doi: 10.7861/clinmed.2022-0435. URL http:// dx.doi.org/10.7861/clinmed.2022-0435. Wells, P., Anderson, D., Rodger, M., Ginsberg, J., Kearon, C., Gent, M., Turpie, A., Bormanis, J., Weitz, J., Cham- berlain, M., Bowie, D., Barnes, D., and Hirsh, J. Deriva- tion of a simple clinical model to categorize patients probability of pulmonary embolism: Increasing ...

  20. [20]

    doi: 10.1016/s0140-6736(97) 08140-3

    ISSN 0140-6736. doi: 10.1016/s0140-6736(97) 08140-3. URL http://dx.doi.org/10.1016/ S0140-6736(97)08140-3. Wieten, S. Expertise in evidence-based medicine: a tale of three models.Philosophy, Ethics, and Humanities in Medicine, 13(1), February 2018. ISSN 1747-5341. doi: 10.1186/s13010-018-0055-2. URL http://dx.doi. org/10.1186/s13010-018-0055-2. Williams, ...

  21. [21]

    normal vs abnormal

    demonstrate a distinct but equally important form of impact in emergency medicine, where a conservative binary checklist enables the safe exclusion of fracture and avoids unnecessary imaging, reducing cost and patient burden without increasing missed injuries. For the Ottawa Ankle Rules, the checklist is operationalized as a conservative OR-rule. 16 Agent...

  22. [22]

    MIMIC-IV(v3.1) (Johnson et al., 2023): A de-identified electronic health record (EHR) database from Beth Israel Deaconess Medical Center containing over 400,000 hospital admissions

  23. [23]

    22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

    eICU Collaborative Research Database(v2.0) (Pollard et al., 2018): A multicenter critical care database comprising ICU stays from 208 hospitals across the United States. 22 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

  24. [24]

    UK Cystic Fibrosis (CF) Registry: A national registry containing annual longitudinal follow-up records for individuals with cystic fibrosis in the United Kingdom

  25. [25]

    Canadian Cystic Fibrosis Registry: A national population-based registry used for external validation of CF mortality prediction

  26. [26]

    Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria

    PhysioNet Challenge 2012(Silva et al., 2012): A publicly available ICU mortality benchmark comprising 8,000 patient episodes from two hospitals. Task definitions.Table 8 summarizes outcome definitions, prediction horizons, index times, and inclusion criteria. We provide additional clarifications below to ensure precise reproducibility. Observation windows...

  27. [27]

    Generation–Selection

    The “Generation–Selection” Gap:State-of-the-art interpretable solvers such as RiskSLIM and FasterRisk act asselectors, requiring a pre-computed feature matrix X∈ {0,1} N×|R univ|. They cannot generate semantic rules dynamically; any clinically meaningful derived constructs (ratios, trends, shallow logic) must be manually engineered and materialized as col...

  28. [28]

    In checklist learning, where both the score and the operating threshold are discrete, such rounding effects are often amplified

    Failure of Continuous Relaxation (e.g., Lasso):Relaxing w∈ {0,1} |Runiv| to continuous weights w∈R |Runiv| introduces a substantialintegrality gap: continuous relaxations induce fractional solutions, and rounding them can change the induced decision boundary and utility relative to the discrete optimum. In checklist learning, where both the score and the ...

  29. [29]

    Age>65” and “Lactate<2.0

    Failure of Classical Heuristics (Genetic Algorithms, SA):Standard heuristic searches struggle with the semantic structure of the rule space: • Undefined Metric Space:Crossover operators in Genetic Algorithms require a meaningful metric space. It is unclear how to interpolate between “Age>65” and “Lactate<2.0”. • Sparse Fitness Landscape:A random mutation ...

  30. [30]

    Primitive Rules:WithT= 20quantile thresholds and range constraints, a crude count gives |Rprim| ≈p×(2T+T 2)≈50×440≈2.2×10 4

  31. [31]

    26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

    Compositional Rules:Allowing depth-1 logical operators (AND/OR) between pairs of primitives yields, up to constants, |Rcomp| ≈2· |Rprim| 2 ≈ O |Rprim|2 ≈(2.2×10 4)2 ≈4.8×10 8. 26 AgentScore: Autoformulation of Deployable Clinical Scoring Systems

  32. [32]

    cold start

    Additional Variants (Temporal + Ratios):Introducing simple temporal summaries (e.g., W= 4 windows ×3 stats = 12 variants) and a restricted set of arithmetic ratios/differences over variable pairs ( 50 2 ≈1225 ) increases the candidate universe by large multiplicative factors. An order-of-magnitude approximation is |Runiv| ≈ |R comp| ×(1 + 12 temporal)×(1 ...