Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models

Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · 2025 · arXiv 2502.14302

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Graph Alignment Topology as an Inductive Bias for Grounding Detection

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

A multi-stage framework with prompt calibration, rule-based filtering, semantic checks, judge LLM review, and predictive validation enables trustworthy LLM extraction of substance use disorder diagnoses from nearly 920,000 clinical notes, achieving F1 of 0.80 and superior care-engagement prediction.

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

cs.CL · 2026-05-05 · unverdicted · novelty 5.0

MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.

citing papers explorer

Showing 4 of 4 citing papers.

Graph Alignment Topology as an Inductive Bias for Grounding Detection cs.CL · 2026-05-21 · unverdicted · none · ref 28
A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.
Hallucination Detection via Activations of Open-Weight Proxy Analyzers cs.CL · 2026-05-08 · unverdicted · none · ref 17
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models cs.CL · 2026-04-07 · unverdicted · none · ref 27
A multi-stage framework with prompt calibration, rule-based filtering, semantic checks, judge LLM review, and predictive validation enables trustworthy LLM extraction of substance use disorder diagnoses from nearly 920,000 clinical notes, achieving F1 of 0.80 and superior care-engagement prediction.
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs cs.CL · 2026-05-05 · unverdicted · none · ref 28
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.

Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models

fields

years

verdicts

representative citing papers

citing papers explorer