pith. sign in

arxiv: 2605.23465 · v1 · pith:7K6CYVMDnew · submitted 2026-05-22 · 💻 cs.CY

Iy\`aw\'oBench: A Benchmark for Evaluating Large Language Model Clinical Triage Accuracy on Undifferentiated Febrile Illness in Nigerian Primary Health Settings

Pith reviewed 2026-05-25 02:57 UTC · model grok-4.3

classification 💻 cs.CY
keywords IyàwóBenchLLM clinical triageundifferentiated febrile illnessNigerian primary health careWHO guidelinessafety scoretriage accuracyWest African primary care
0
0 comments X

The pith

A new benchmark of 200 Nigerian primary care cases shows all tested LLMs triage febrile illness safely but vary sharply in accuracy, with guideline-embedded models leading by up to 28.5 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates IyàwóBench v1.0 from statistical distributions of 1,200 real patient visits at 19 Nigerian primary health centres to test how well large language models classify undifferentiated febrile illness into treat-on-site or refer-now decisions. Six models were scored on triage accuracy and a safety metric that checks whether critical cases are ever wrongly downgraded. All six reached 100 percent safety, yet accuracy ranged from 39 percent for the smallest general model to 67.5 percent for the top clinically engineered one. The work supplies the first reproducible test set for LLM clinical decision support in West African primary care, where fever drives the largest share of outpatient visits.

Core claim

Modern LLMs exhibit safe triage behaviour on undifferentiated febrile illness but vary substantially in structured clinical accuracy. Clinically engineered systems with embedded WHO guidelines outperform general-purpose models by up to 28.5 percentage points. IyàwóBench provides the first reproducible evaluation framework for LLM clinical decision support in West African primary care.

What carries the argument

IyàwóBench v1.0, a dataset of 200 synthetic clinical vignettes across eight febrile illness categories derived from real PHC encounter distributions, scored on structured triage classification with separate accuracy and safety metrics.

If this is right

  • LLMs can be used for initial triage without risking downgrades of critical REFER NOW cases.
  • Embedding WHO guidelines inside the model raises triage accuracy by as much as 28.5 points over general-purpose versions.
  • Two of the six models produced near-zero usable output due to failure to follow the required structured format.
  • The benchmark supplies a fixed, shareable test set that any new model can be run against for direct comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers could use the benchmark to test whether adding explicit referral thresholds improves accuracy without harming the observed safety floor.
  • The same vignette-generation method could be applied to other high-volume conditions such as malaria or respiratory infections in similar settings.
  • Health systems might first pilot only the highest-accuracy models identified here rather than defaulting to the largest available general model.

Load-bearing premise

The synthetic vignettes generated from statistical distributions of 1,200 real patient encounters accurately represent the range of undifferentiated febrile illness presentations and triage decisions in Nigerian primary health settings.

What would settle it

Applying the same six models to 200 real (non-synthetic) patient records from the original 19 PHCs and measuring agreement with expert clinician triage decisions on those exact cases.

read the original abstract

Background. Undifferentiated febrile illness is the leading cause of primary care outpatient visits in Nigeria, yet no validated benchmark exists for evaluating large language model (LLM) clinical triage reasoning in West African primary health settings. Methods. We introduce Iy\`aw\'oBench v1.0, a dataset of 200 synthetic clinical vignettes across eight febrile illness categories derived from statistical distributions of 1,200 real patient encounters at 19 primary health centres (PHCs) in Oyo State, Nigeria. Six LLMs were evaluated on structured triage classification across two metrics: triage accuracy and safety score. Results. All six models achieved 100% safety scores (95% CI: 96.4-100.0%), never downgrading a critical REFER NOW case to TREAT HERE. Triage accuracy varied substantially: Claude Sonnet (claude-sonnet-4-5) 67.5% (95% CI: 60.8-73.7%), Llama 4 Scout 59.5% (52.5-66.2%), Llama 3.3 70B 43.0% (36.2-50.0%), and Llama 3.1 8B 39.0% (32.4-45.9%). Two models demonstrated near-zero accuracy attributable to structured output non-compliance. Conclusions. Modern LLMs exhibit safe triage behaviour but vary substantially in structured clinical accuracy. Clinically engineered systems with embedded WHO guidelines outperform general-purpose models by up to 28.5 percentage points. Iy\`aw\'oBench provides the first reproducible evaluation framework for LLM clinical decision support in West African primary care.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces IyàwóBench v1.0, a benchmark of 200 synthetic clinical vignettes derived from statistical distributions of 1,200 real patient encounters at 19 PHCs in Oyo State, Nigeria. It evaluates six LLMs on structured triage classification for undifferentiated febrile illness using triage accuracy and safety score metrics, reporting 100% safety across all models (never downgrading REFER NOW cases) but accuracies ranging from 39.0% (Llama 3.1 8B) to 67.5% (Claude Sonnet), with clinically engineered systems outperforming general-purpose models by up to 28.5 points and two models showing near-zero accuracy due to output non-compliance.

Significance. If the vignettes are representative, this establishes the first reproducible benchmark for LLM clinical triage reasoning in West African primary care settings. The universal safety result alongside variable accuracy provides actionable evidence on LLM limitations and the value of embedding WHO guidelines, supporting safer AI deployment in low-resource PHCs. The empirical design and framework reproducibility are strengths.

major comments (2)
  1. [Methods (vignette synthesis)] Methods section on vignette synthesis: The description states vignettes are 'derived from statistical distributions of 1,200 real patient encounters' but provides no details on whether features are sampled from marginal distributions independently or from joint distributions that preserve correlations (e.g., malaria-anemia co-occurrence, age-specific severity). This is load-bearing for the central claims, as the reported accuracies and 100% safety scores (Results) could be artifacts if real-world joint probabilities and contextual factors are omitted.
  2. [Results] Results (model evaluation): The attribution of near-zero accuracy in two models to 'structured output non-compliance' is noted, but the manuscript does not provide the exact evaluation prompts, output parsing rules, or compliance criteria used. This prevents verification of whether the accuracy gaps (e.g., 67.5% vs. 39.0%) reflect model capability or prompt sensitivity, weakening interpretation of the performance variation claim.
minor comments (1)
  1. [Abstract] Abstract: Model names are inconsistently formatted (e.g., 'claude-sonnet-4-5' vs. 'Llama 4 Scout'); standardize for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where additional methodological transparency would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Methods (vignette synthesis)] Methods section on vignette synthesis: The description states vignettes are 'derived from statistical distributions of 1,200 real patient encounters' but provides no details on whether features are sampled from marginal distributions independently or from joint distributions that preserve correlations (e.g., malaria-anemia co-occurrence, age-specific severity). This is load-bearing for the central claims, as the reported accuracies and 100% safety scores (Results) could be artifacts if real-world joint probabilities and contextual factors are omitted.

    Authors: We agree that the current Methods description is insufficiently detailed on this point. The manuscript states only that vignettes were 'derived from statistical distributions' without specifying marginal versus joint sampling or how correlations were preserved. We will revise the Methods section (and add an appendix if needed) to explicitly describe the synthesis procedure, including the distributions used and any steps taken to maintain feature correlations observed in the source data. revision: yes

  2. Referee: [Results] Results (model evaluation): The attribution of near-zero accuracy in two models to 'structured output non-compliance' is noted, but the manuscript does not provide the exact evaluation prompts, output parsing rules, or compliance criteria used. This prevents verification of whether the accuracy gaps (e.g., 67.5% vs. 39.0%) reflect model capability or prompt sensitivity, weakening interpretation of the performance variation claim.

    Authors: We concur that the absence of the exact prompts, parsing rules, and compliance criteria limits independent verification of the results. We will add the full evaluation prompts, output parsing logic, and compliance definitions to the revised manuscript (as supplementary material) so that readers can assess whether observed differences arise from model behavior or implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct measurements

full rationale

The paper introduces a benchmark dataset of synthetic vignettes derived from real patient encounter statistics and reports direct LLM performance metrics (accuracy, safety scores) on that fixed dataset. No derivations, fitted parameters, predictions, or self-citations are used to generate the central results; the reported percentages are straightforward empirical measurements on the 200 vignettes. The construction of the vignettes from marginal distributions is a methodological choice whose validity can be assessed externally against real data, but it does not create a self-referential loop within the paper's claims. This matches the default expectation for non-circular empirical evaluation studies.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims depend on the fidelity of the synthetic data to real cases and the validity of the triage metrics, which are domain assumptions drawn from clinical practice rather than proven within the paper.

free parameters (1)
  • Statistical parameters for vignette synthesis
    Derived from distributions in 1,200 real encounters to generate the 200 vignettes across eight categories
axioms (2)
  • domain assumption The eight febrile illness categories cover the relevant presentations of undifferentiated febrile illness
    Basis for structuring the dataset and classification task
  • domain assumption Safety score is appropriately defined as avoiding downgrade of REFER NOW cases
    Underpins the 100% safety result reported

pith-pipeline@v0.9.0 · 5880 in / 1412 out tokens · 38759 ms · 2026-05-25T02:57:24.798767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Annual Report 2023

    National Primary Health Care Development Agency. Annual Report 2023. Abuja: NPHCDA; 2023

  2. [2]

    Industrial action by healthcare workers in Nigeria in 2013-2015: an inquiry into causes, consequences and control

    Oleribe OO, Ezieme IP, Oladipo O, Akinola EP, Udofia D, Taylor-Robinson SD. Industrial action by healthcare workers in Nigeria in 2013-2015: an inquiry into causes, consequences and control. Hum Resour Health. 2016;14(1):46

  3. [3]

    Prevalence and predictors of severe malaria and febrile illness among children in Nigeria

    Mokuolu OA, Ntadom GN, Ajayi NA, et al. Prevalence and predictors of severe malaria and febrile illness among children in Nigeria. Trans R Soc Trop Med Hyg. 2015;109(9):567-574

  4. [4]

    Corticosteroids for acute bacterial meningitis

    Brouwer MC, McIntyre P, Prasad K, van de Beek D. Corticosteroids for acute bacterial meningitis. Cochrane Database Syst Rev. 2015;(9):CD004405

  5. [5]

    The Third Interna- tional Consensus Definitions for Sepsis and Septic Shock (Sepsis-3)

    Singer M, Deutschman CS, Seymour CW, et al. The Third Interna- tional Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801-810

  6. [6]

    WHO Guidelines for Malaria, 16 October 2025

    World Health Organization. WHO Guidelines for Malaria, 16 October 2025. Geneva: WHO; 2025

  7. [7]

    The rise and fall of supervision in a project designed to strengthen supervision of Integrated Management of Childhood Illness in Benin

    Rowe AK, Onikpo F, Lama M, Deming MS. The rise and fall of supervision in a project designed to strengthen supervision of Integrated Management of Childhood Illness in Benin. Health Policy Plan. 2010;25(2):125-134

  8. [8]

    Cost-effectiveness of community case management of childhood illnesses using lay community health workers in Ghana

    Dalaba MA, Welaga P, Kondayire JA, et al. Cost-effectiveness of community case management of childhood illnesses using lay community health workers in Ghana. Glob Health Action. 2020;13(1):1832585

  9. [9]

    Evaluation of electronic Inte- grated Management of Childhood Illness implementation in 12 health districts of Burkina Faso: a pre-post study

    Colacino L, Kouadio IK, Paupert M, et al. Evaluation of electronic Inte- grated Management of Childhood Illness implementation in 12 health districts of Burkina Faso: a pre-post study. BMC Health Serv Res. 2021;21(1):1011

  10. [10]

    The shaky foundations of large language models and foundation models for electronic health records

    Wornow M, Xu Y, Lavin R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):135

  11. [11]

    The future landscape of large language models in medicine

    Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med. 2023;3(1):141

  12. [12]

    Large language models encode clinical knowledge

    Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180

  13. [13]

    Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration

    Moons KGM, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1-73

  14. [14]

    Company Statement on Operational Restructuring

    Babylon Health. Company Statement on Operational Restructuring. Lon- don: Babylon Health; August 2023

  15. [15]

    Safety and performance of an AI- powered clinical decision support system for primary care in Kenya: a prospec- tive evaluation study

    Rowe SL, Ndegwa SN, Karanja S, et al. Safety and performance of an AI- powered clinical decision support system for primary care in Kenya: a prospec- tive evaluation study. PLOS Digit Health. 2024;3(4):e0000481. 11

  16. [16]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams

    Jin D, Pan E, Oufattole N, Weng W, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421

  17. [17]

    MedMCQA: A large-scale multi- subject multi-choice dataset for medical domain question answering

    Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A large-scale multi- subject multi-choice dataset for medical domain question answering. Proc Mach Learn Res. 2022;174:248-260

  18. [18]

    PubMedQA: A biomedical research question answering dataset

    Jin Q, Dhingra B, Liu T, Cohen W, Lu X. PubMedQA: A biomedical research question answering dataset. Proc 2019 Conf Empir Methods Nat Lang Process. 2019:2567-2577

  19. [19]

    Integrated Management of Childhood Illness Chart Booklet

    World Health Organization. Integrated Management of Childhood Illness Chart Booklet. Geneva: WHO; 2014

  20. [20]

    WHO Guidelines for Malaria

    World Health Organization. WHO Guidelines for Malaria. Geneva: WHO; 2025

  21. [21]

    Surviving Sepsis Campaign: Inter- national Guidelines for Management of Sepsis and Septic Shock 2021

    Evans L, Rhodes A, Alhazzani W, et al. Surviving Sepsis Campaign: Inter- national Guidelines for Management of Sepsis and Septic Shock 2021. Intensive Care Med. 2021;47(11):1181-1247

  22. [22]

    Standard Treatment Guidelines, 5th edition

    Federal Ministry of Health Nigeria. Standard Treatment Guidelines, 5th edition. Abuja: FMOH; 2022

  23. [23]

    Probable inference, the law of succession, and statistical infer- ence

    Wilson EB. Probable inference, the law of succession, and statistical infer- ence. J Am Stat Assoc. 1927;22(158):209-212

  24. [24]

    Note on the sampling error of the difference between correlated proportions or percentages

    McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153-157

  25. [25]

    Chain-of-thought prompting elicits reasoning in large language models

    Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824- 24837

  26. [26]

    Mistral 7B

    Jiang AQ, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023

  27. [27]

    Measuring Digital Development: Facts and Figures 2023

    International Telecommunication Union. Measuring Digital Development: Facts and Figures 2023. Geneva: ITU; 2023

  28. [28]

    LoRA: Low-rank adaptation of large language models

    Hu E, Shen Y, Wallis P, et al. LoRA: Low-rank adaptation of large language models. Proc Int Conf Learn Represent. 2022

  29. [29]

    Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

    Sclar M, Choi Y, Tsvetkov Y, Suresh A. Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. Proc Int Conf Learn Represent. 2024. 12