Iy\`aw\'oBench: A Benchmark for Evaluating Large Language Model Clinical Triage Accuracy on Undifferentiated Febrile Illness in Nigerian Primary Health Settings
Pith reviewed 2026-05-25 02:57 UTC · model grok-4.3
The pith
A new benchmark of 200 Nigerian primary care cases shows all tested LLMs triage febrile illness safely but vary sharply in accuracy, with guideline-embedded models leading by up to 28.5 points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern LLMs exhibit safe triage behaviour on undifferentiated febrile illness but vary substantially in structured clinical accuracy. Clinically engineered systems with embedded WHO guidelines outperform general-purpose models by up to 28.5 percentage points. IyàwóBench provides the first reproducible evaluation framework for LLM clinical decision support in West African primary care.
What carries the argument
IyàwóBench v1.0, a dataset of 200 synthetic clinical vignettes across eight febrile illness categories derived from real PHC encounter distributions, scored on structured triage classification with separate accuracy and safety metrics.
If this is right
- LLMs can be used for initial triage without risking downgrades of critical REFER NOW cases.
- Embedding WHO guidelines inside the model raises triage accuracy by as much as 28.5 points over general-purpose versions.
- Two of the six models produced near-zero usable output due to failure to follow the required structured format.
- The benchmark supplies a fixed, shareable test set that any new model can be run against for direct comparison.
Where Pith is reading between the lines
- Model developers could use the benchmark to test whether adding explicit referral thresholds improves accuracy without harming the observed safety floor.
- The same vignette-generation method could be applied to other high-volume conditions such as malaria or respiratory infections in similar settings.
- Health systems might first pilot only the highest-accuracy models identified here rather than defaulting to the largest available general model.
Load-bearing premise
The synthetic vignettes generated from statistical distributions of 1,200 real patient encounters accurately represent the range of undifferentiated febrile illness presentations and triage decisions in Nigerian primary health settings.
What would settle it
Applying the same six models to 200 real (non-synthetic) patient records from the original 19 PHCs and measuring agreement with expert clinician triage decisions on those exact cases.
read the original abstract
Background. Undifferentiated febrile illness is the leading cause of primary care outpatient visits in Nigeria, yet no validated benchmark exists for evaluating large language model (LLM) clinical triage reasoning in West African primary health settings. Methods. We introduce Iy\`aw\'oBench v1.0, a dataset of 200 synthetic clinical vignettes across eight febrile illness categories derived from statistical distributions of 1,200 real patient encounters at 19 primary health centres (PHCs) in Oyo State, Nigeria. Six LLMs were evaluated on structured triage classification across two metrics: triage accuracy and safety score. Results. All six models achieved 100% safety scores (95% CI: 96.4-100.0%), never downgrading a critical REFER NOW case to TREAT HERE. Triage accuracy varied substantially: Claude Sonnet (claude-sonnet-4-5) 67.5% (95% CI: 60.8-73.7%), Llama 4 Scout 59.5% (52.5-66.2%), Llama 3.3 70B 43.0% (36.2-50.0%), and Llama 3.1 8B 39.0% (32.4-45.9%). Two models demonstrated near-zero accuracy attributable to structured output non-compliance. Conclusions. Modern LLMs exhibit safe triage behaviour but vary substantially in structured clinical accuracy. Clinically engineered systems with embedded WHO guidelines outperform general-purpose models by up to 28.5 percentage points. Iy\`aw\'oBench provides the first reproducible evaluation framework for LLM clinical decision support in West African primary care.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IyàwóBench v1.0, a benchmark of 200 synthetic clinical vignettes derived from statistical distributions of 1,200 real patient encounters at 19 PHCs in Oyo State, Nigeria. It evaluates six LLMs on structured triage classification for undifferentiated febrile illness using triage accuracy and safety score metrics, reporting 100% safety across all models (never downgrading REFER NOW cases) but accuracies ranging from 39.0% (Llama 3.1 8B) to 67.5% (Claude Sonnet), with clinically engineered systems outperforming general-purpose models by up to 28.5 points and two models showing near-zero accuracy due to output non-compliance.
Significance. If the vignettes are representative, this establishes the first reproducible benchmark for LLM clinical triage reasoning in West African primary care settings. The universal safety result alongside variable accuracy provides actionable evidence on LLM limitations and the value of embedding WHO guidelines, supporting safer AI deployment in low-resource PHCs. The empirical design and framework reproducibility are strengths.
major comments (2)
- [Methods (vignette synthesis)] Methods section on vignette synthesis: The description states vignettes are 'derived from statistical distributions of 1,200 real patient encounters' but provides no details on whether features are sampled from marginal distributions independently or from joint distributions that preserve correlations (e.g., malaria-anemia co-occurrence, age-specific severity). This is load-bearing for the central claims, as the reported accuracies and 100% safety scores (Results) could be artifacts if real-world joint probabilities and contextual factors are omitted.
- [Results] Results (model evaluation): The attribution of near-zero accuracy in two models to 'structured output non-compliance' is noted, but the manuscript does not provide the exact evaluation prompts, output parsing rules, or compliance criteria used. This prevents verification of whether the accuracy gaps (e.g., 67.5% vs. 39.0%) reflect model capability or prompt sensitivity, weakening interpretation of the performance variation claim.
minor comments (1)
- [Abstract] Abstract: Model names are inconsistently formatted (e.g., 'claude-sonnet-4-5' vs. 'Llama 4 Scout'); standardize for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for identifying areas where additional methodological transparency would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods (vignette synthesis)] Methods section on vignette synthesis: The description states vignettes are 'derived from statistical distributions of 1,200 real patient encounters' but provides no details on whether features are sampled from marginal distributions independently or from joint distributions that preserve correlations (e.g., malaria-anemia co-occurrence, age-specific severity). This is load-bearing for the central claims, as the reported accuracies and 100% safety scores (Results) could be artifacts if real-world joint probabilities and contextual factors are omitted.
Authors: We agree that the current Methods description is insufficiently detailed on this point. The manuscript states only that vignettes were 'derived from statistical distributions' without specifying marginal versus joint sampling or how correlations were preserved. We will revise the Methods section (and add an appendix if needed) to explicitly describe the synthesis procedure, including the distributions used and any steps taken to maintain feature correlations observed in the source data. revision: yes
-
Referee: [Results] Results (model evaluation): The attribution of near-zero accuracy in two models to 'structured output non-compliance' is noted, but the manuscript does not provide the exact evaluation prompts, output parsing rules, or compliance criteria used. This prevents verification of whether the accuracy gaps (e.g., 67.5% vs. 39.0%) reflect model capability or prompt sensitivity, weakening interpretation of the performance variation claim.
Authors: We concur that the absence of the exact prompts, parsing rules, and compliance criteria limits independent verification of the results. We will add the full evaluation prompts, output parsing logic, and compliance definitions to the revised manuscript (as supplementary material) so that readers can assess whether observed differences arise from model behavior or implementation choices. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with direct measurements
full rationale
The paper introduces a benchmark dataset of synthetic vignettes derived from real patient encounter statistics and reports direct LLM performance metrics (accuracy, safety scores) on that fixed dataset. No derivations, fitted parameters, predictions, or self-citations are used to generate the central results; the reported percentages are straightforward empirical measurements on the 200 vignettes. The construction of the vignettes from marginal distributions is a methodological choice whose validity can be assessed externally against real data, but it does not create a self-referential loop within the paper's claims. This matches the default expectation for non-circular empirical evaluation studies.
Axiom & Free-Parameter Ledger
free parameters (1)
- Statistical parameters for vignette synthesis
axioms (2)
- domain assumption The eight febrile illness categories cover the relevant presentations of undifferentiated febrile illness
- domain assumption Safety score is appropriately defined as avoiding downgrade of REFER NOW cases
Reference graph
Works this paper leans on
-
[1]
National Primary Health Care Development Agency. Annual Report 2023. Abuja: NPHCDA; 2023
work page 2023
-
[2]
Oleribe OO, Ezieme IP, Oladipo O, Akinola EP, Udofia D, Taylor-Robinson SD. Industrial action by healthcare workers in Nigeria in 2013-2015: an inquiry into causes, consequences and control. Hum Resour Health. 2016;14(1):46
work page 2013
-
[3]
Prevalence and predictors of severe malaria and febrile illness among children in Nigeria
Mokuolu OA, Ntadom GN, Ajayi NA, et al. Prevalence and predictors of severe malaria and febrile illness among children in Nigeria. Trans R Soc Trop Med Hyg. 2015;109(9):567-574
work page 2015
-
[4]
Corticosteroids for acute bacterial meningitis
Brouwer MC, McIntyre P, Prasad K, van de Beek D. Corticosteroids for acute bacterial meningitis. Cochrane Database Syst Rev. 2015;(9):CD004405
work page 2015
-
[5]
The Third Interna- tional Consensus Definitions for Sepsis and Septic Shock (Sepsis-3)
Singer M, Deutschman CS, Seymour CW, et al. The Third Interna- tional Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801-810
work page 2016
-
[6]
WHO Guidelines for Malaria, 16 October 2025
World Health Organization. WHO Guidelines for Malaria, 16 October 2025. Geneva: WHO; 2025
work page 2025
-
[7]
Rowe AK, Onikpo F, Lama M, Deming MS. The rise and fall of supervision in a project designed to strengthen supervision of Integrated Management of Childhood Illness in Benin. Health Policy Plan. 2010;25(2):125-134
work page 2010
-
[8]
Dalaba MA, Welaga P, Kondayire JA, et al. Cost-effectiveness of community case management of childhood illnesses using lay community health workers in Ghana. Glob Health Action. 2020;13(1):1832585
work page 2020
-
[9]
Colacino L, Kouadio IK, Paupert M, et al. Evaluation of electronic Inte- grated Management of Childhood Illness implementation in 12 health districts of Burkina Faso: a pre-post study. BMC Health Serv Res. 2021;21(1):1011
work page 2021
-
[10]
The shaky foundations of large language models and foundation models for electronic health records
Wornow M, Xu Y, Lavin R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):135
work page 2023
-
[11]
The future landscape of large language models in medicine
Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med. 2023;3(1):141
work page 2023
-
[12]
Large language models encode clinical knowledge
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180
work page 2023
-
[13]
Moons KGM, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1-73
work page 2015
-
[14]
Company Statement on Operational Restructuring
Babylon Health. Company Statement on Operational Restructuring. Lon- don: Babylon Health; August 2023
work page 2023
-
[15]
Rowe SL, Ndegwa SN, Karanja S, et al. Safety and performance of an AI- powered clinical decision support system for primary care in Kenya: a prospec- tive evaluation study. PLOS Digit Health. 2024;3(4):e0000481. 11
work page 2024
-
[16]
Jin D, Pan E, Oufattole N, Weng W, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421
work page 2021
-
[17]
MedMCQA: A large-scale multi- subject multi-choice dataset for medical domain question answering
Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A large-scale multi- subject multi-choice dataset for medical domain question answering. Proc Mach Learn Res. 2022;174:248-260
work page 2022
-
[18]
PubMedQA: A biomedical research question answering dataset
Jin Q, Dhingra B, Liu T, Cohen W, Lu X. PubMedQA: A biomedical research question answering dataset. Proc 2019 Conf Empir Methods Nat Lang Process. 2019:2567-2577
work page 2019
-
[19]
Integrated Management of Childhood Illness Chart Booklet
World Health Organization. Integrated Management of Childhood Illness Chart Booklet. Geneva: WHO; 2014
work page 2014
-
[20]
World Health Organization. WHO Guidelines for Malaria. Geneva: WHO; 2025
work page 2025
-
[21]
Surviving Sepsis Campaign: Inter- national Guidelines for Management of Sepsis and Septic Shock 2021
Evans L, Rhodes A, Alhazzani W, et al. Surviving Sepsis Campaign: Inter- national Guidelines for Management of Sepsis and Septic Shock 2021. Intensive Care Med. 2021;47(11):1181-1247
work page 2021
-
[22]
Standard Treatment Guidelines, 5th edition
Federal Ministry of Health Nigeria. Standard Treatment Guidelines, 5th edition. Abuja: FMOH; 2022
work page 2022
-
[23]
Probable inference, the law of succession, and statistical infer- ence
Wilson EB. Probable inference, the law of succession, and statistical infer- ence. J Am Stat Assoc. 1927;22(158):209-212
work page 1927
-
[24]
Note on the sampling error of the difference between correlated proportions or percentages
McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153-157
work page 1947
-
[25]
Chain-of-thought prompting elicits reasoning in large language models
Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824- 24837
work page 2022
-
[26]
Jiang AQ, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Measuring Digital Development: Facts and Figures 2023
International Telecommunication Union. Measuring Digital Development: Facts and Figures 2023. Geneva: ITU; 2023
work page 2023
-
[28]
LoRA: Low-rank adaptation of large language models
Hu E, Shen Y, Wallis P, et al. LoRA: Low-rank adaptation of large language models. Proc Int Conf Learn Represent. 2022
work page 2022
-
[29]
Sclar M, Choi Y, Tsvetkov Y, Suresh A. Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. Proc Int Conf Learn Represent. 2024. 12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.