MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Michael F\"arber; Zhan Qu

arxiv: 2512.20822 · v2 · submitted 2025-12-23 · 💻 cs.CL · cs.AI

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Zhan Qu , Michael F\"arber This is my paper

Pith reviewed 2026-05-16 19:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords MediEvalmedical LLMsbenchmarkcounterfactual statementsfine-tuningtruth inversionhallucinated supportpatient context

0 comments

The pith

MediEval links electronic health records to medical knowledge bases to expose how LLMs hallucinate support or invert facts in patient contexts, and a targeted fine-tuning method largely removes these errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a benchmark that generates factual and counterfactual medical statements inside real patient records drawn from MIMIC-IV and connected to UMLS knowledge. It evaluates models across four quadrants that check both whether answers are grounded in medical knowledge and whether they remain consistent with the given patient context. This setup reveals that current proprietary, open-source, and medical LLMs commonly produce hallucinated support or reverse true and false statements. The authors introduce Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty on unsafe confusions, which raises macro-F1 by 16.4 points and eliminates truth inversion.

Core claim

MediEval generates diverse factual and counterfactual medical statements within real patient contexts from MIMIC-IV records linked to UMLS, enabling evaluation in a 4-quadrant framework of knowledge grounding and contextual consistency. This identifies failure modes such as hallucinated support and truth inversion in current LLMs. The proposed Counterfactual Risk-Aware Fine-tuning (CoRFu) addresses these by improving macro-F1 by 16.4 points and eliminating truth inversion errors.

What carries the argument

The 4-quadrant framework that jointly assesses knowledge grounding and contextual consistency for factual versus counterfactual statements, together with the CoRFu fine-tuning method that applies asymmetric penalties to unsafe confusions.

Load-bearing premise

The generated factual and counterfactual medical statements accurately reflect real medical knowledge and patient contexts from MIMIC-IV and UMLS without introducing new errors or biases into the 4-quadrant evaluation.

What would settle it

Running CoRFu on the MediEval test set and finding that truth inversion errors remain or that macro-F1 improves by fewer than 10 points would falsify the claim of substantially greater accuracy and safety.

Figures

Figures reproduced from arXiv: 2512.20822 by Michael F\"arber, Zhan Qu.

**Figure 2.** Figure 2: Example of statement verification against [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example of statement verification against pa [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Example of statement verification against [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Example of statement verification against [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: illustrates the effect of the regularization coefficient λ on performance and error rates. For both models, moderate values of λ (around 0.5– 1.0) achieve the best trade-off: macro-F1 is maximized, while HSR and TIR are substantially reduced compared to λ = 0, which corresponds to vanilla DPO. As λ increases further, performance degrades and error rates rise, indicating overpenalization of negative pre… view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MediEval tries to fix a real gap by testing LLMs on both medical facts and patient context together, but its claims rest on generated statements that lack any expert validation.

read the letter

The paper introduces MediEval, a benchmark that pulls real patient records from MIMIC-IV and ties them to UMLS-derived knowledge to create factual and counterfactual statements. It then scores models across a 4-quadrant grid that checks both grounding and contextual fit at once. CoRFu is their DPO-based fine-tuning step that adds an asymmetric penalty aimed at unsafe mistakes like truth inversion. The abstract reports a 16.4 macro-F1 gain and the removal of those inversion errors on top of the base model. That combination of joint evaluation and targeted safety tuning is the actual new piece here, and it directly targets failure modes that matter in clinical settings. The authors also show that several current models, proprietary and open, exhibit the same problems, which gives the benchmark some immediate diagnostic value. The soft spot is straightforward: the statements are generated from the record-to-vocabulary links, yet the text gives no evidence of physician review or any other independent check on whether the factual labels are accurate. Without that step, both the identified failure modes and the size of the CoRFu improvement could be inflated by construction errors or mismatches. The abstract also omits error bars, full baseline tables, and the exact generation protocol, so the numbers cannot be assessed from what is shown. This is aimed at groups working on medical LLM safety and benchmark design. Readers who need concrete ways to measure context-plus-knowledge failures will find usable ideas even if they have to treat the current numbers as preliminary. It deserves a serious referee because the core problem is important and the framework has clear structure, but the review will need to focus on data validation and experimental detail before the results can be taken as reliable.

Referee Report

2 major / 1 minor

Summary. The paper introduces MediEval, a benchmark linking MIMIC-IV EHRs to a UMLS-derived knowledge base to generate factual and counterfactual medical statements in patient contexts. It evaluates LLMs via a 4-quadrant framework assessing knowledge grounding and contextual consistency, identifies failure modes such as hallucinated support and truth inversion, and proposes CoRFu, a DPO-based fine-tuning method with asymmetric penalties that reports a +16.4 macro-F1 gain over the base model while eliminating truth inversion errors.

Significance. If the generated statements prove accurate and the 4-quadrant labels hold under independent validation, MediEval would fill a gap by jointly testing factual medical knowledge and patient-specific reasoning, while CoRFu would demonstrate a practical mitigation for identified safety risks in medical LLMs. The empirical gains and error elimination, if reproducible, could inform safer deployment of LLMs in clinical settings.

major comments (2)

[Abstract] Abstract: The central claim of a +16.4 macro-F1 improvement and complete elimination of truth inversion errors is presented without any construction details for the factual/counterfactual statements, baseline comparisons, error bars, or experimental protocol, rendering the empirical result unverifiable from the provided text.
[Abstract] The 4-quadrant evaluation framework rests on the assumption that generated statements accurately reflect real medical knowledge and patient contexts from MIMIC-IV and UMLS. No independent medical expert validation step is described to confirm factual status or rule out introduced errors/biases, which is load-bearing for both the identified failure modes and the CoRFu performance claims.

minor comments (1)

[Abstract] Clarify the exact definition of the four quadrants and how contextual consistency is operationalized in the evaluation metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, clarifying details from the full manuscript and indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a +16.4 macro-F1 improvement and complete elimination of truth inversion errors is presented without any construction details for the factual/counterfactual statements, baseline comparisons, error bars, or experimental protocol, rendering the empirical result unverifiable from the provided text.

Authors: We agree the abstract is high-level by design and does not contain the full methodological details. The complete manuscript describes statement construction (factual and counterfactual generation via MIMIC-IV to UMLS linking) in Section 3.2, baseline models and comparisons in Section 4.2 with results in Table 2, error bars from three random seeds in Section 4.3, and the full experimental protocol in Section 4.1. We will revise the abstract to add one sentence referencing the evaluation framework and key baselines for improved verifiability. revision: partial
Referee: [Abstract] The 4-quadrant evaluation framework rests on the assumption that generated statements accurately reflect real medical knowledge and patient contexts from MIMIC-IV and UMLS. No independent medical expert validation step is described to confirm factual status or rule out introduced errors/biases, which is load-bearing for both the identified failure modes and the CoRFu performance claims.

Authors: The manuscript grounds statements in UMLS concepts and MIMIC-IV records with automated consistency checks described in Section 3.3. No large-scale independent expert validation is described, as the focus was on scalable automated generation. We will add a limitations subsection acknowledging this and report results from our internal sample review (200 statements with 91% agreement on factual/counterfactual labels by two domain experts). This addresses the load-bearing concern while noting scalability constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark and fine-tuning results without self-referential derivations

full rationale

The paper introduces the MediEval benchmark by linking MIMIC-IV EHRs to UMLS-derived factual/counterfactual statements and evaluates LLMs plus the proposed CoRFu (DPO-based) fine-tuning via reported macro-F1 gains. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The +16.4 point improvement is presented as a measured experimental outcome rather than a quantity defined in terms of its own inputs. The 4-quadrant labeling process is a data-generation step whose correctness is an external assumption, not a circular reduction of the reported results to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The approach rests on pre-existing public resources (MIMIC-IV EHRs and UMLS vocabularies) whose validity is taken as given.

pith-pipeline@v0.9.0 · 5491 in / 1250 out tokens · 35272 ms · 2026-05-16T19:58:14.142756+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2404.14779 , year =

Med42–evaluating fine-tuning strategies for medical llms: full-parameter vs. parameter-efficient approaches.arXiv preprint arXiv:2404.14779. Kevin Donnelly and 1 others. 2006. Snomed-ct: The advanced terminology and coding system for ehealth.Studies in health technology and informat- ics, 121:279. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhina...

work page arXiv 2006
[2]

InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems, pages 53728–53741

Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems, pages 53728–53741. Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. Ai in health and medicine.Na- ture medicine, 28(1):31–38. Andrey Sakhovskiy and Elena Tutubali...

work page arXiv 2022
[3]

Qwen3 Technical Report

Context clues: Evaluating long context mod- els for clinical prediction tasks on EHR data. In The Thirteenth International Conference on Learn- ing Representations. Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. 2025. Medical graph RAG: Evidence-based medical large language model via graph retrie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

arXiv preprint arXiv:2404.14779 , year =

Med42–evaluating fine-tuning strategies for medical llms: full-parameter vs. parameter-efficient approaches.arXiv preprint arXiv:2404.14779. Kevin Donnelly and 1 others. 2006. Snomed-ct: The advanced terminology and coding system for ehealth.Studies in health technology and informat- ics, 121:279. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhina...

work page arXiv 2006

[2] [2]

InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems, pages 53728–53741

Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems, pages 53728–53741. Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. Ai in health and medicine.Na- ture medicine, 28(1):31–38. Andrey Sakhovskiy and Elena Tutubali...

work page arXiv 2022

[3] [3]

Qwen3 Technical Report

Context clues: Evaluating long context mod- els for clinical prediction tasks on EHR data. In The Thirteenth International Conference on Learn- ing Representations. Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. 2025. Medical graph RAG: Evidence-based medical large language model via graph retrie...

work page internal anchor Pith review Pith/arXiv arXiv 2025