pith. sign in

arxiv: 2605.01417 · v1 · submitted 2026-05-02 · 💻 cs.CL · cs.AI

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Pith reviewed 2026-05-09 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationmedical benchmarksclinical reasoningquestion answeringinformation extractionmodel biastoken efficiencyopen-source suite
0
0 comments X

The pith

Medmarks supplies a fully open benchmark suite of 30 medical tasks to evaluate LLMs on question answering, extraction, calculations, and clinical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Medmarks to overcome saturation and restricted data problems in existing medical LLM evaluations. It assembles 30 open benchmarks across four task types and runs them on 61 models in 71 configurations. The results indicate that top frontier reasoning models lead overall, proprietary models tend to use fewer tokens than open-weight ones, medically tuned models beat their generalist bases, and smaller models plus certain frontier ones show sensitivity to answer ordering. A subset of the tasks is structured so it can serve directly as environments for reinforcement learning.

Core claim

Medmarks is an open-source suite that standardizes evaluation across 30 tasks in question answering, information extraction, medical calculations, and open-ended clinical reasoning, with results from 61 models showing frontier reasoning systems at the top, proprietary models more token-efficient, fine-tuned medical models stronger than generalist counterparts, and widespread susceptibility to answer-order bias.

What carries the argument

The Medmarks benchmark suite itself, built from 30 verifiable tasks divided into question answering, information extraction, medical calculations, and open-ended clinical reasoning, scored with exact metrics plus LLM-as-a-Judge for the reasoning subset.

If this is right

  • Frontier reasoning models such as Gemini 3 Pro Preview, GPT-5.1 and GPT-5.2 lead performance across the full set of tasks.
  • Most proprietary frontier models consume significantly fewer tokens than open-weight alternatives on the same inputs.
  • Models that have been fine-tuned on medical data outperform their generalist base versions on the benchmarks.
  • Answer-order bias appears in many models, with smaller ones and Grok 4 showing it most clearly.
  • The Medmarks-T subset supplies ready reinforcement-learning environments for post-training medical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized open benchmarks like this could let independent groups compare medical LLMs without depending on private datasets.
  • Token-efficiency differences may become a practical selection criterion when deploying models in cost-sensitive clinical settings.
  • Answer-order bias could be mitigated by randomizing option order or using chain-of-thought prompting that does not rely on multiple-choice formats.

Load-bearing premise

The 30 selected tasks together give a representative picture of the medical capabilities needed in real clinical use, and that LLM-as-a-Judge scoring is reliable enough for the open-ended reasoning tasks.

What would settle it

A new model that scores near the top on all 30 Medmarks tasks yet performs poorly when tested by human clinicians on the same task types in live settings would show the suite fails to capture essential medical performance.

Figures

Figures reproduced from arXiv: 2605.01417 by Adepoju Jeremiah Moyondafoluwa, Ahmed Essouaied, Ameen Patel, Anas Zafar, Anish Mahishi, Arya Hariharan, Ashish Vashist, Aymane Ouraq, Benjamin Warner, Bofeng Huang, Connor Lane, Geetu Ambwani, Harsh Deshpande, Hunar Batra, Jean-Benoit Delbrouck, Johannes Hagemann, Kunal Bagga, Leema Krishna Murali, Manish Ram, Maxime Griot, Max Kieffer, Molly Beavers, Nikhil Khandekar, Nishant Mishra, Paul Steven Scotti, Ratna Sagari Grandhi, Robert Scholz, Ronald Clark, Sameed Khan, Saurav Panigrahi, Shamus Sim Zi Yang, Siddhant Bharadwaj, Srishti Gureja, Tanishq Mathew Abraham, William Brown.

Figure 1
Figure 1. Figure 1: Results on MEDMARKS-V and MEDMARKS-OE for subset of models evaluated on both benchmarks. Accurately tracking the medical capabilities of frontier LLMs and understanding their limitations is crucial to ensur￾ing the safe deployment of current LLMs and to improving future generations of LLMs for medical applications. Med￾ical LLM benchmarks have seen wide adoption but they have all saturated, heavily depend … view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of model scores based on model category for the MEDMARKS-V subset. ( view at source ↗
Figure 4
Figure 4. Figure 4: A scatter plot of the mean win rate on MEDMARKS-V by cost for the model APIs evaluated. We measured average inference cost per example and the total inference cost on MEDMARKS-V for the top 12 per￾forming models, as shown in view at source ↗
Figure 6
Figure 6. Figure 6: Win Rate change between instruction and reasoning models Adding reasoning does not always guarantee improvement, as seen with the Ministral family of models. Here, the reasoning models underperform, sometimes significantly, compared to their instruction counterparts on almost all datasets. From our medical benchmark alone, we cannot tell if this is a case of catastrophic forgetting or overfitting, overly d… view at source ↗
Figure 8
Figure 8. Figure 8: Win Rate change between quantized models Our results show minimal performance degradation from quantization on most datasets, with the more aggressively quantized AWQ 4-bit model suffering a small but consistent penalty on most datasets. Our results align with (Zheng et al., 2025), who report that model degradation starts at 4-bit quantization for AWQ and GPTQ formats, and becomes more pronounced with more… view at source ↗
Figure 10
Figure 10. Figure 10: Test accuracy and training reward for Qwen-3-4B￾Instruct-0725 trained on MedCalc-Bench, MedMCQA, and Med￾CaseReasoning over the course of training for 330 steps. In view at source ↗
Figure 9
Figure 9. Figure 9: Comparing model performance with and without an extra option on the Medbullets (Chen et al., 2025b) benchmark. 3.12. Medical-specific Post-Training with MEDMARKS-T Since we implemented the datasets in the verifiers frame￾work (Brown, 2025), the datasets that come with train/test splits can be easily used as RL environments to post￾train LLMs. The datasets with training splits are: MedQA, MedMCQA, PubMedQA,… view at source ↗
Figure 11
Figure 11. Figure 11: MedCalcBench with and without tools. 25 view at source ↗
Figure 12
Figure 12. Figure 12: Subset of models on MedAgentBench V2. This benchmark has two sets of notable results: first, GLM 4.7 narrowly outperforms both GPT-5.1 medium and significantly outperforms Claude Sonnet 4.5, and Qwen3 235B-A22B Thinking significantly underperforms Qwen3 30B-A3B Thinking view at source ↗
Figure 13
Figure 13. Figure 13: Heatmap table of the raw scores for each model across the 19 benchmarks of the MEDMARKS-V subset. Dark purple highlights low performance, bright yellow highlights high performance. Metrics are dependent on the benchmark. 31 view at source ↗
Figure 14
Figure 14. Figure 14: Heatmap table of the raw scores for each model across 11 benchmarks of the MEDMARKS-OE subset. Dark purple highlights low performance, bright yellow highlights high performance. Metrics vary by benchmark; the reported metric is a normalized LLM Judge score. 32 view at source ↗
Figure 15
Figure 15. Figure 15: Variance of model performance across all MEDMARKS-V benchmarks. Dark purple highlights low spread, bright yellow highlights high spread. Given we typically evaluate a model three times on each benchmark, we report the maximum score subtracted by the minimum score for a given benchmark. Note that since we only performed a single evaluation for HeadQA-v2 (Correa-Guillen et al. ´ , 2025), MedCalc-Bench (Khan… view at source ↗
Figure 16
Figure 16. Figure 16: Distribution of normalized model performance for each of the datasets in the MEDMARKS-V subset, across all 61 models across 71 variants tested. 34 view at source ↗
Figure 18
Figure 18. Figure 18: Scatter plot of model scores on each of the MEDMARKS-V benchmarks, labeled by model size. 35 view at source ↗
Figure 17
Figure 17. Figure 17: Distribution of normalized model performance for each of the datasets in both the MEDMARKS-V and MEDMARKS-OE subsets across the 12 models that were evaluated on both subsets view at source ↗
Figure 19
Figure 19. Figure 19: Comparing the performance of Gemma 3 models to MedGemma 3 models on MEDMARKS-V tasks. 36 view at source ↗
Figure 20
Figure 20. Figure 20: Scatter plot of weighted mean win rate on MEDMARKS-V by average tokens per response for each model. Each model is labeled by model size and whether it is a thinking model or standard model. 37 view at source ↗
Figure 21
Figure 21. Figure 21: Bar plots comparing performance of instruct vs. reasoning models for Ministral 3, Olmo 3, and Qwen3 models. 38 view at source ↗
Figure 22
Figure 22. Figure 22: Distribution of number of tokens generated for different models when the response is correct or incorrect. 39 view at source ↗
Figure 23
Figure 23. Figure 23: Bar plots comparing performance of different reasoning levels for gpt-oss models on the MEDMARKS-V benchmarks. 40 view at source ↗
Figure 24
Figure 24. Figure 24: Distribution of number of tokens generated for gpt-oss reasoning levels models when the response is correct or incorrect. 41 view at source ↗
Figure 25
Figure 25. Figure 25: Bar plots comparing performance of quantization levels for Qwen3 models on the MEDMARKS-V benchmarks. 42 view at source ↗
Figure 26
Figure 26. Figure 26: Comparing all model performance with and without an extra option on the Medbullets (Chen et al., 2025b) benchmark. 43 view at source ↗
read the original abstract

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Medmarks, a fully open-source benchmark suite consisting of 30 tasks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. It reports a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge, with key findings that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, GPT-5.2) achieve the highest performance, proprietary frontier models are more token-efficient than open-weight alternatives, medically fine-tuned models outperform generalist counterparts, and models (especially smaller ones and Grok 4) exhibit answer-order bias. A subset of the benchmarks (Medmarks-T) is positioned for direct use as reinforcement learning environments.

Significance. If the central claims hold, Medmarks would constitute a useful, accessible addition to the medical LLM evaluation landscape by addressing benchmark saturation and data-access restrictions in prior suites. The open-source release, public code repository, and framing of a subset as RL environments are concrete strengths that could support reproducible research and post-training work. The scale of the model evaluation (61 models) also provides a broad snapshot of current capabilities.

major comments (2)
  1. [Abstract] Abstract: The primary results on model rankings for open-ended clinical reasoning tasks rest on LLM-as-a-Judge scores, yet no information is supplied on the specific judge model, prompt template, calibration procedure against expert physicians, or agreement statistics (e.g., Cohen's kappa or correlation with human ratings). In a safety-critical medical domain, this leaves open the possibility that scores reward stylistic features rather than clinical correctness, directly undermining the comparative claims.
  2. [Abstract] Abstract: The assertion that the 30 benchmarks provide comprehensive coverage of medical tasks is presented without supporting analysis of task selection criteria, overlap with existing saturated benchmarks, or gaps in clinical reasoning categories (e.g., differential diagnosis or longitudinal patient management). This choice is load-bearing for the claim of a 'comprehensive' suite.
minor comments (1)
  1. [Abstract] The abstract refers to 'verifiable metrics' for some tasks but does not enumerate which tasks use which metrics or provide example implementations; adding a brief table or reference to the code repository in the main text would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The primary results on model rankings for open-ended clinical reasoning tasks rest on LLM-as-a-Judge scores, yet no information is supplied on the specific judge model, prompt template, calibration procedure against expert physicians, or agreement statistics (e.g., Cohen's kappa or correlation with human ratings). In a safety-critical medical domain, this leaves open the possibility that scores reward stylistic features rather than clinical correctness, directly undermining the comparative claims.

    Authors: We agree that transparency regarding the LLM-as-a-Judge is critical in a medical context to ensure scores reflect clinical correctness rather than style. The manuscript references LLM-as-a-Judge for open-ended tasks but does not supply the specific details requested. We will revise the Methods section to explicitly describe the judge model, the prompt template, any calibration steps against expert physicians, and quantitative agreement statistics (such as Cohen's kappa or correlation with human ratings). This addition will directly address the concern and allow readers to evaluate the reliability of the comparative claims. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that the 30 benchmarks provide comprehensive coverage of medical tasks is presented without supporting analysis of task selection criteria, overlap with existing saturated benchmarks, or gaps in clinical reasoning categories (e.g., differential diagnosis or longitudinal patient management). This choice is load-bearing for the claim of a 'comprehensive' suite.

    Authors: We acknowledge that the 'comprehensive' framing requires explicit justification to be fully convincing. While the manuscript describes the 30 tasks as spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning, it does not include a dedicated analysis of selection criteria, overlaps with prior benchmarks, or remaining gaps. In the revised manuscript, we will add a subsection (likely in the Introduction or a new 'Benchmark Design' section) that details the task selection process, compares coverage against saturated benchmarks (e.g., MedQA, PubMedQA, MMLU medical subsets), and explicitly discusses gaps such as differential diagnosis and longitudinal patient management. This will provide a clearer rationale for the suite's scope. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark suite release with no derivations or self-referential predictions

full rationale

The paper presents Medmarks as an open-source collection of 30 benchmarks and reports empirical evaluations of 61 models using verifiable metrics and LLM-as-a-Judge. No equations, fitted parameters, uniqueness theorems, or derivation chains are claimed or present in the provided text. All results follow directly from running external models on the released benchmarks rather than reducing to internal definitions or self-citations. This is a standard data-release and evaluation paper whose central claims rest on observable model outputs, not on any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the contribution is an empirical benchmark collection and evaluation rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5667 in / 1194 out tokens · 25168 ms · 2026-05-09T14:30:22.526259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    naacl-long.182/

    URL https://aclanthology.org/2025. naacl-long.182/. Chen, H., Fang, Z., Singla, Y ., and Dredze, M. Benchmark- ing large language models on answering and explaining challenging medical questions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...

  2. [2]

    emnlp-main.759/

    URL https://aclanthology.org/2024. emnlp-main.759/. Loshchilov, I. and Hutter, F. Decoupled weight decay reg- ularization. InInternational Conference on Learning Representations, 2019. McCoy, L. G., Swamy, R., Sagar, N., Wang, M., Bacchi, S., Fong, J. M. N., Tan, N. C., Tan, K., Buckley, T. A., Brodeur, P., et al. Assessment of large language models in cl...

  3. [3]

    Metlay, J

    URL https://www.arcee.ai/blog/ deep-dive-afm-4-5b-the-first-arcee-foundational-model . Metlay, J. P., Waterer, G. W., Long, A. C., Anzueto, A., Brozek, J., Crothers, K., Cooley, L. A., Dean, N. C., Fine, M. J., Flanders, S. A., Griffin, M. R., Meter- sky, M. L., Musher, D. M., Restrepo, M. I., and Whit- ney, C. G. Diagnosis and Treatment of Adults with Co...

  4. [4]

    Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning.arXiv preprint arXiv:2503.07459, 2025

    doi: 10.48550/arXiv.2503.07459. URL https: //arxiv.org/abs/2503.07459. Team, A. Antangelmed: A high-performance medical lan- guage model with efficient moe-powered clinical rea- soning, 2025a. URL https://huggingface.co/ MedAIBase/AntAngelMed. Team, B. M. Baichuan-m3: Modeling clinical in- quiry for reliable medical decision-making, 2025b. URL https://git...

  5. [5]

    none of the above

    MedConceptsQA (Shoham & Rappoport, 2024) and MedDialog (He et al., 2020) have 819K and 25k examples, respectively, so we only evaluate on a subset of these datasets. MedConceptsQA tests the model’s knowledge of 20 Medmarks Table 7.Medical benchmark datasets in MEDMARKSfor LLM evaluation. “–” indicates no dedicated training split. Dataset Description #Eval...

  6. [6]

    the answer is:

    We only perform one run of HeadQA-v2 (Correa-Guill´en et al., 2025), MedCalc-Bench (Khandekar et al., 2024), and Med-HALT (Pal et al., 2023) instead of three. 21 Medmarks G. Models Table 8.Models evaluated in MEDMARKS. Size categories: Tiny ( <7B), Small (7–19B), Medium (20–40B), Large (>40B on single node), API (proprietary or multi-node). Model Size Sam...

  7. [7]

    - Allow synonyms, paraphrasing, acronyms, and reasonable generalizations that still unambiguously answer the question correctly

    Semantically Correct (true/false) - True if the assistant expresses the same core claim(s) as the reference. - Allow synonyms, paraphrasing, acronyms, and reasonable generalizations that still unambiguously answer the question correctly. - False if the main concept/mechanism/entity/relationship differs or if the answer is too vague to establish the refere...

  8. [8]

    Ignore extra illustrative or optional context in the reference

    Matches Details (true/false) - True if the assistant includes all question-critical details needed to uniquely match the reference answer. Ignore extra illustrative or optional context in the reference. - False if any required specifics or details from the reference are missing, overgeneralized where precision matters, or incorrect. - Constraint: If Seman...

  9. [9]

    Substantive Addition (true/false) - True if the assistant introduces factual claim(s) that could meaningfully alter correctness assessment: tangential or off-topic content, claims or details beyond the question’s scope, or alternative explanations/approaches not consistent with the reference. - False for definitions, brief clarifying context, stylistic el...

  10. [10]

    Blockers

    Critical Error (true/false) - True if the assistant states any factual claim that is clearly false relative to the reference and/or standard domain knowledge, or gives unsafe medical guidance. - False if no clearly incorrect, contradictory, unsafe, or fabricated factual claims are present. - Note: Missing information alone is not a critical error (it affe...

  11. [11]

    TaskMCQA in English and Spanish

    The English translation was performed using GPT-4, and the open-ended version was created via rephrasing with Qwen2.5-72B-Instruct, followed by human validation. TaskMCQA in English and Spanish. Open-Ended QA in English.; Inputs/OutputsQuestion→selected option or open-ended answer; Evaluation • Close-ended Evaluation - For close-ended evaluations, the met...

  12. [12]

    N-gram based metrics: ROUGE1, ROUGE2, ROUGEL, and BLEU - these evaluate the overlap of n-grams between generated and reference answers

  13. [13]

    Semantic similarity metrics: BERTScore, BLEURT, and MoverScore - these evaluate semantic similarity between generated and reference text using embeddings or deep learning models

  14. [14]

    Perplexity metrics: Word Perplexity, Bits per Byte, and Byte Perplexity - these assess the model’s predictive capabilities. Q.30. MTSamples-Procedures (Bedi et al., 2026) MTSamples Procedures is a benchmark composed of transcribed operative notes, focused on documenting surgical procedures. Each example presents a brief patient case involving a surgical i...