pith. sign in

arxiv: 2503.17599 · v3 · pith:XLP6RMQVnew · submitted 2025-03-22 · 💻 cs.CL · cs.AI

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Pith reviewed 2026-05-22 23:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsgeneral practiceclinical benchmarkLLM evaluationmedical AIcompetency assessmentprimary care
0
0 comments X

The pith

Large language models are not suitable for autonomous deployment in general practice based on a new expert-annotated benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GPBench to test LLMs against the actual daily responsibilities of general practitioners rather than simplified exam questions. Ten state-of-the-art models were evaluated using data annotated by domain experts to match routine clinical standards. The results show consistent gaps in performance that prevent independent operation. This finding indicates that realistic clinical use of these models requires ongoing human oversight and that further model optimization focused on GP tasks is still needed.

Core claim

The paper establishes that current LLMs are not suitable for autonomous deployment in clinical general practice. All realistic applications require continuous human oversight, and further optimization specifically tailored to the daily responsibilities of GPs remains essential. The conclusion rests on evaluations conducted with GPBench, a benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards.

What carries the argument

GPBench, a general practice benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards, used as the evaluation framework to measure LLM competencies against GP duties.

If this is right

  • Current LLMs cannot reliably fulfill the duties of general practitioners.
  • All realistic applications of LLMs in general practice require continuous human oversight.
  • Further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • GPBench scores could serve as a baseline for measuring progress in future LLM versions aimed at primary care.
  • Similar expert-annotated benchmarks may prove useful for evaluating LLMs in other medical specialties.
  • Practical deployment strategies could combine LLMs with structured human review processes calibrated to GPBench performance levels.

Load-bearing premise

The GPBench framework, constructed from expert annotations aligned with routine clinical practice standards, accurately measures the competencies needed for autonomous general practitioner duties.

What would settle it

A demonstration that an LLM reaches human GP performance levels on GPBench tasks during real clinical deployment with no ongoing supervision would challenge the central claim.

read the original abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper introduces GPBench, a competency-based evaluation framework and benchmark for LLMs in general practice, constructed from expert-annotated cases aligned with routine clinical practice standards. It evaluates ten state-of-the-art LLMs on this benchmark and concludes that current LLMs are not suitable for autonomous deployment in clinical general practice, that all realistic applications require continuous human oversight, and that further optimization tailored to GP daily responsibilities remains essential.

Significance. The introduction of a competency-based framework that moves beyond exam-style or simplified QA formats to align with real-world GP responsibilities is a clear strength and addresses a documented gap in existing medical AI benchmarks. If the GPBench items are shown to be reliable and if performance on them correlates with clinical safety outcomes, the results would provide actionable evidence on the current limitations of LLMs for high-stakes autonomous use and reinforce the importance of human-in-the-loop designs in clinical AI.

major comments (4)
  1. [Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.
  2. [GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.
  3. [Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.
  4. [Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.
minor comments (1)
  1. [Abstract] The abstract would benefit from inclusion of the key quantitative LLM performance figures to allow readers to gauge the magnitude of the reported gaps.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, which highlights important areas for strengthening the manuscript's claims and transparency. We address each major comment point-by-point below, proposing revisions where the points identify verifiable gaps in the current version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the central claim. In the revised version, we will expand the abstract to include summary performance metrics across the ten LLMs (e.g., overall competency scores and key failure rates), a brief reference to the expert annotation process and inter-annotator agreement, and mention of the error analysis presented in the results section. This will make the evidence supporting the conclusion more immediately assessable while remaining within abstract length constraints. revision: yes

  2. Referee: [GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.

    Authors: This is a valid observation. While the annotation followed a multi-expert process with standardized clinical guidelines, inter-annotator agreement statistics were not reported in the original submission. We will add these metrics (e.g., Cohen's kappa or percentage agreement) to the methods section of the revised manuscript, calculated on a subset of overlapping annotations, to provide quantitative evidence of benchmark reliability. revision: yes

  3. Referee: [Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.

    Authors: We acknowledge the value of a direct human baseline for calibration. The GPBench items were constructed from routine clinical practice standards that licensed GPs are expected to meet, providing an implicit reference point. However, administering the full benchmark to a cohort of licensed GPs would require substantial additional resources, recruitment, and ethics approvals not feasible within the current study timeline. In the revision, we will explicitly discuss this limitation, provide qualitative context on expected GP performance based on the competency framework, and recommend human baseline collection as important future work. revision: partial

  4. Referee: [Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.

    Authors: We agree that stronger linkage to real-world safety outcomes would reinforce the practical implications. The current inference draws from the benchmark's alignment with documented GP competencies, where failures in areas such as diagnosis, management, or communication carry inherent clinical risks. We will revise the discussion to more explicitly acknowledge the absence of direct adverse-event mapping, clarify that the recommendation for human oversight is based on competency gaps rather than proven harm, and highlight the need for future studies correlating GPBench performance with primary-care safety data. revision: yes

Circularity Check

0 steps flagged

No circularity: new expert-annotated benchmark with no equations, fits, or self-referential derivations

full rationale

The paper constructs GPBench from fresh expert annotations aligned to routine clinical standards and evaluates LLMs on it. No equations, parameters, or predictions are fitted to subsets of the data; the central claim (LLMs unsuitable for autonomous use) follows directly from observed scores on the new benchmark rather than reducing to any self-definition, self-citation chain, or renamed known result. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework depends on the assumption that expert annotations faithfully represent real GP competencies; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Domain expert annotations accurately reflect routine clinical practice standards for general practitioners.
    Benchmark data are meticulously annotated by domain experts in accordance with routine clinical practice standards.

pith-pipeline@v0.9.0 · 5749 in / 925 out tokens · 39126 ms · 2026-05-22T23:38:03.903457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

  1. [1]

    Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (gpt-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. The Lancet Digital Health 7, e35–e43 (2025)

  2. [2]

    & Bignami, E

    Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (gpt-4) in clinical settings. The Lancet Digital Health 7, e6–e7 (2025)

  3. [3]

    Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

  4. [4]

    Singhal, K. et al. Toward expert-level medical question answering with large language models. Nature Medicine 1–8 (2025)

  5. [5]

    Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine 183, 1028–1030 (2023)

  6. [6]

    Gilson, A. et al. How does chatgpt perform on the united states medical licens- ing examination (usmle)? the implications of large language models for medical education and knowledge assessment. JMIR medical education 9, e45312 (2023)

  7. [7]

    Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021)

  8. [8]

    McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 1–7 (2025)

  9. [9]

    Hurst, A. et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  10. [10]

    Jaech, A. et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

  11. [11]

    Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  12. [12]

    Yang, A. et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)

  13. [13]

    Liu, A. et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). 17

  14. [14]

    Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  15. [15]

    Chen, J. et al. Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

  16. [16]

    The european definition of general practice/family medicine-2023 edition

    Europe, W. The european definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe (2023). URL https://www.woncaeurope. org/page/definition-of-general-practice-family-medicine

  17. [17]

    Scherger, J. E. Preparing the personal physician for practice (p4): essential skills for new family physicians and how residency programs may provide them. The Journal of the American Board of Family Medicine 20, 348–355 (2007)

  18. [18]

    McClelland, D. C. Testing for competence rather than for” intelligence.”. American psychologist 28, 1 (1973)

  19. [19]

    Boyatzis, R. E. The competent manager: A model for effective performance (John Wiley & Sons, 1991)

  20. [20]

    Wang, X. et al. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023)

  21. [21]

    Liu, M. et al. Medbench: A comprehensive, standardized, and reliable bench- marking system for evaluating chinese medical large language models. Big Data Mining and Analytics (2024). URL https://www.sciopen.com/article/10.26599/ BDMA.2024.9020044

  22. [22]

    Collaborators, G. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the global burden of disease study 2013. The Lancet 385, 117–171 (2015)

  23. [23]

    Zhou, M. et al. Cause-specific mortality for 240 causes in china during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. The Lancet 387, 251–272 (2016)

  24. [24]

    Peng, W. et al. Trends in major non-communicable diseases and related risk factors in china 2002–2019: an analysis of nationally representative survey data. The Lancet Regional Health–Western Pacific 43 (2024)

  25. [25]

    Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine 8, 159 (2025)

  26. [26]

    preventing diseases before they occur, preventing disease progression during illness, and preventing recurrence after illness

    Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 1–9 (2025). 18 Appendix A The competency indicators and definitions used in our proposed evaluation framework Table A1: The competency indicators and definitions used in our proposed evaluation framework. Primary Indicator Secondary Indicator Definition I1. Basic Medical Knowl...

  27. [27]

    Secondary thrombocytopenia

  28. [28]

    General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications

    Stage 2 hypertension, high-risk category Treatment: I. General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications. For high fever, physical cooling or appropriate use of antipyretic agents may be employed, avoiding excessive sweating caused by certain ant...

  29. [29]

    Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days

    Tetracyclines: Recommended dose of doxycycline for adults is 0 .1 g twice a day; the first dose is doubled. Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days. In severe cases, intravenous administration of doxycycline may be considered: on d...

  30. [30]

    Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily

    Macrolides: Roxithromycin 150 mg twice a day for adults; after fever subsides, 150 mg once a day. Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily. Clarithromycin 0.5 g once every 12 hours for adults. All of these regimens are given for 7—10 days. III. Symptomatic and Supportive Treatment In cases of scrub typhus complica...

  31. [31]

    Splenomegaly (10 points)

  32. [32]

    Hepatic insufficiency (10 points)

  33. [33]

    Pleural effusion (10 points)

  34. [34]

    Hypoproteinemia (10 points)

  35. [35]

    Secondary thrombocytopenia (10 points)

  36. [36]

    Grade 2 hypertension; high-risk group (10 points) Referral Decision- making Yes Referral to the department of infectious diseases (100 points). Acute and Critical Condition Recogni- tion Yes Hepatic insufficiency (25 points); Pleural effusion (25 points); Hypoproteinemia (25 points); Secondary thrombocytopenia (25 points) Complication Iden- tification Yes...

  37. [37]

    The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days

    Tetracyclines: Doxycycline Recommended dosage for adults: 0.1 g twice a day, with the initial dose doubled. The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days. For severe cases, intravenous infusion may be selected. On the firs...

  38. [38]

    I’m not sure

    Macrolides: Roxithromycin 150 mg twice a day for adults; once the fever subsides, 150 mg once a day. Azithromycin 0.5 g once a day for adults; after deferves- cence, 0.25 g once a day. Clarithromycin 0.5 g per dose for adults, administered orally once every 12 hours. All of these medications are given for a total of 7–10 days. II. Symptomatic and Supporti...

  39. [39]

    Hello, what seems to be the problem?

    Start the consultation by saying: “Hello, what seems to be the problem?”

  40. [40]

    The following physical examinations are necessary:

    When you believe you have enough information to recommend physical examination items, provide them with the phrase: “The following physical examinations are necessary:”

  41. [41]

    The following auxiliary tests are necessary:

    When you believe you have enough information to recommend auxiliary tests, provide them with the phrase: “The following auxiliary tests are necessary:”

  42. [42]

    The following is the diagnosis and treatment plan:

    When you believe you have enough information to make a detailed diagnosis, provide it with the phrase: “The following is the diagnosis and treatment plan:”

  43. [43]

    Consultation ended

    After providing the diagnosis and treatment plan, conclude with: “Consultation ended.” 27 Appendix H An example of LLMs’ response and its evaluation result Table H7: An example of a CAD medical case. Case Patient: male, 74 years old. Chief Complaint: Intermittent chest pain for 2 years, worsened over the past week. Present Illness: Two years ago, the pati...

  44. [44]

    Atherosclerotic coronary artery disease (effort angina, CCS class II, high risk)

  45. [45]

    Hypertension stage 2 (very high-risk group) Continued on next page 28 Case Treatment: I. Further Examinations Complete blood count, coagulation profile, blood glucose, lipid panel, liver and renal function, BNP, homocysteine, urinalysis; 24 h ambulatory blood pressure monitoring, 24h Holter ECG, echocardiography, vascular ultrasound (carotid and lower ext...

  46. [46]

    Avoid fried food, animal organs, and cholesterol-rich food

    Diet: low salt ( < 6g/day), low fat ( < 25g/day), light and easily digestible foods rich in vita- mins. Avoid fried food, animal organs, and cholesterol-rich food. Prefer small, frequent meals; avoid overeating. No spicy food, alcohol, tobacco, strong tea, or coffee

  47. [47]

    Physical activity: mild walking before coronary angiography; after procedure, gradually resume moderate aerobic exercise (30–45 min, 5 times/week) depending on condition

  48. [48]

    Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

    Self-monitoring of BP and heart rate daily, record symptoms (chest pain, tightness, foamy urine), take medications regularly. Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

  49. [49]

    Psychological support: help reduce stress, ensure adequate sleep, encourage adherence to treatment plan

  50. [50]

    Regular follow-up: every 1–3 months to monitor drug side effects

  51. [51]

    Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary

    Risk assessment via coronary angiography: for patients with mild angina (CCS I–II) or asymp- tomatic, yet high-risk by non-invasive stratification, consider revascularization. Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary. III. Pharmacological Treatment

  52. [52]

    Heart rate control: e.g., metoprolol tartrate 50 mg twice daily, target HR 55–60/min

  53. [53]

    Coronary vasodilation: e.g., isosorbide mononitrate capsule 40 mg once daily

  54. [54]

    Lipid regulation: e.g., atorvastatin 20 mg once every evening, LDL < 1.8 mmol/L

  55. [55]

    Antiplatelet therapy: aspirin 100 mg once every evening

  56. [56]

    Antihypertensive: e.g., irbesartan/hydrochlorothiazide 150 /12.5 mg once daily; maintain BP < 130/80 mmHg if tolerated

  57. [57]

    Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results

    Follow-up regularly; seek care if symptoms occur. Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results. Detailed Evaluation Results Indicator: Diagnosis and Differential Diagnosis Applicability: Yes Scoring Criteria:

  58. [58]

    Coronary Atherosclerotic Heart Disease (30 points): Effort-induced angina, CCS Class II, high risk (40 points)

  59. [59]

    Grade 2 Hypertension (Very high-risk group) (30 points) LLM Response: The patient’s clinical manifestations and auxiliary examination results suggest the possible presence of stable angina, primarily caused by coronary artery stenosis. Coronary CTA shows approximately 70% stenosis in the proximal segment of the left anterior descending artery, which is co...

  60. [60]

    Pharmacological Treatment (Half of the corresponding points will be deducted for incorrect usage or dosage): Heart rate control (Target heart rate: 55–60 bpm, e.g., Metoprolol tartrate 50mg orally twice daily or Metoprolol succinate 23.75mg orally once daily) (15 points); Coronary vasodilation (e.g., Isosorbide mononitrate capsules 40 mg orally once daily...

  61. [61]

    Additionally, coronary interventional therapy, such as stent implantation, is recommended to improve blood flow

    Regular Follow-up: Follow-up every 1–3 months with monitoring of drug side effects (10 points) LLM Response: Optimize pharmacological treatment by using aspirin 100 mg once daily to prevent thrombosis, atorvastatin 20 mg once daily to control blood lipids, metoprolol 50 mg twice daily to reduce cardiac workload, and nitroglycerin sustained-release tablets...

  62. [62]

    The diet should be light, easy to digest, and rich in vitamins

    Healthy diet: A low-salt, low-fat diet is recommended (20 points), with daily salt intake < 6g and cooking oil < 25g. The diet should be light, easy to digest, and rich in vitamins. Avoid fried foods, animal organs, and other high-cholesterol foods. Eat small, frequent meals and avoid overeating. Spicy foods should be avoided, and smoking, alcohol, strong...

  63. [63]

    After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition

    Appropriate physical activity: Before coronary angiography, light walking is recommended while avoiding strenuous exercise. After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition. (20 points) Continued on next page 30 Detailed Evaluation Results

  64. [64]

    Be alert to symp- toms such as chest pain, chest tightness, and foamy urine

    Daily monitoring: Measure and record blood pressure and heart rate daily. Be alert to symp- toms such as chest pain, chest tightness, and foamy urine. Seek timely medical consultation and adhere strictly to medication schedules. For patients with coronary artery disease and hypertension, a blood pressure target of < 140/90mmHg is recommended; if tolerated...

  65. [65]

    Psychological support: Help patients prevent and relieve mental stress, maintain sufficient sleep, and actively cooperate with the prescribed treatment plan. (20 points) LLM Response: Instruct the patient to follow a low-salt, low-fat diet; encourage regular and moderate physical activity; educate the patient to take medications on time and regularly moni...