A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.
Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
A multi-stage framework with prompt calibration, rule-based filtering, semantic checks, judge LLM review, and predictive validation enables trustworthy LLM extraction of substance use disorder diagnoses from nearly 920,000 clinical notes, achieving F1 of 0.80 and superior care-engagement prediction.
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.
citing papers explorer
-
Graph Alignment Topology as an Inductive Bias for Grounding Detection
A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.
-
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
-
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
A multi-stage framework with prompt calibration, rule-based filtering, semantic checks, judge LLM review, and predictive validation enables trustworthy LLM extraction of substance use disorder diagnoses from nearly 920,000 clinical notes, achieving F1 of 0.80 and superior care-engagement prediction.
-
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.