Recognition: unknown
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
Pith reviewed 2026-05-08 17:40 UTC · model grok-4.3
The pith
MedFabric creates realistic word-level fabrications in medical texts so EtHER can detect them more accurately than prior detectors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce MedFabric, a dataset of realistic word-level fabrications generated by a pipeline that preserves syntactic and stylistic fidelity while introducing subtle factual deviations, and EtHER, a modular detector combining Text2Table Decomposition, Word Masking and Filling, and Hybrid Sentence Pair Evaluation, which outperforms existing detectors by more than 15 percent on word-level fabrication benchmarks while remaining stable across structural similarities.
What carries the argument
The data-centric generation pipeline that produces MedFabric and the EtHER detector built from Text2Table Decomposition, Word Masking and Filling, and Hybrid Sentence Pair Evaluation.
If this is right
- Detection accuracy rises by more than 15 percent on word-level medical fabrication benchmarks.
- Performance stays consistent when input sentences share similar structures.
- Fabrications can be located at the exact word level instead of sentence level.
- The same pipeline and detector can serve as a template for factuality checks in other specialized domains.
Where Pith is reading between the lines
- The method could reduce reliance on large human-labeled medical hallucination datasets by using synthetic examples.
- Word-level signals might support automated correction of errors inside generated medical text.
- If the synthetic distribution holds, the framework could be reused for ongoing monitoring of deployed medical LLMs.
Load-bearing premise
The synthetic fabrications created by the pipeline match the distribution and subtlety of fabrications that real medical language models actually produce.
What would settle it
A test set of fabrications directly extracted from medical LLMs that were identified by domain experts, where EtHER trained on MedFabric shows no accuracy gain over prior detectors.
Figures
read the original abstract
Large Language Models exhibit strong reasoning and semantic understanding capabilities but often hallucinate in domains that require expert knowledge, among which fabrications, the generation of factually incorrect yet fluent statements, pose the greatest risk in medical contexts. Existing medical hallucination datasets inadequately capture fabrication phenomena due to limited fabrication coverage, stylistic disparities between human and LLM-authored texts, and distributional drift during hallucinated sample synthesis. To address this, we propose a data-centric pipeline to generate realistic and word-level fabrications that preserve syntactic and stylistic fidelity while introducing subtle factual deviations, resulting in MedFabric. Building upon this dataset, we introduce ETHER, a modular word-level fabrication detector integrating Text2Table Decomposition, Word Masking and Filling and Hybrid Sentence Pair Evaluation to enhance factual alignment. Empirical results demonstrate that MedFabric outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while maintaining consistent performance across structural similarities, offering a comprehensive framework for reliable and domain-specific factuality detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedFabric, a dataset of word-level fabrications in medical texts generated via a data-centric pipeline that preserves syntactic and stylistic fidelity while inserting subtle factual deviations, and EtHER, a modular detector that combines Text2Table Decomposition, Word Masking and Filling, and Hybrid Sentence Pair Evaluation. It claims that EtHER outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while showing consistent performance across structural similarities, providing a framework for domain-specific factuality detection in medical LLMs.
Significance. If the central empirical claims hold after addressing validation gaps, the work would offer a practical data-centric approach to hallucination detection in high-stakes medical applications, where fabrications carry significant risk. The emphasis on word-level granularity and stylistic preservation distinguishes it from coarser sentence-level methods and could support more precise interventions in LLM outputs.
major comments (2)
- [Abstract and MedFabric pipeline description] The central claim of >15% outperformance (Abstract) rests on benchmarks derived from the same MedFabric synthetic pipeline used for training EtHER. Without an independent test set of human-expert-labeled real LLM fabrications on clinical queries, it is impossible to rule out that reported gains reflect shared generative artifacts (e.g., localized entity swaps or numerical deviations) rather than genuine detection improvements. This is load-bearing for the empirical contribution.
- [Abstract and empirical results section] The paper asserts that MedFabric captures 'realistic' fabrications and that EtHER maintains 'consistent performance across structural similarities,' yet supplies no details on baseline detectors, statistical significance tests, dataset sizes, or the exact metric for structural similarity (Abstract). These omissions prevent verification of the consistency claim and comparison fairness.
minor comments (1)
- [EtHER architecture] Clarify the exact composition of the Hybrid Sentence Pair Evaluation module and how it differs from standard NLI or entailment baselines.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below, providing clarifications and outlining planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and MedFabric pipeline description] The central claim of >15% outperformance (Abstract) rests on benchmarks derived from the same MedFabric synthetic pipeline used for training EtHER. Without an independent test set of human-expert-labeled real LLM fabrications on clinical queries, it is impossible to rule out that reported gains reflect shared generative artifacts (e.g., localized entity swaps or numerical deviations) rather than genuine detection improvements. This is load-bearing for the empirical contribution.
Authors: We appreciate the referee's emphasis on the importance of external validation. MedFabric is constructed via a data-centric pipeline that deliberately introduces subtle factual deviations (e.g., entity substitutions and numerical inconsistencies) while enforcing syntactic and stylistic fidelity to real medical text, with the explicit goal of approximating LLM fabrication patterns observed in clinical domains. Evaluation is performed on held-out test splits to measure generalization within this controlled distribution. We agree that an independent corpus of human-expert-annotated real fabrications would provide stronger evidence against distribution-specific artifacts. Because no such public dataset currently exists and its creation would require substantial new expert annotation resources, we will revise the manuscript to (i) explicitly state this limitation in the discussion section and (ii) add further ablation studies that vary the fabrication generation parameters to demonstrate robustness beyond obvious artifacts. revision: partial
-
Referee: [Abstract and empirical results section] The paper asserts that MedFabric captures 'realistic' fabrications and that EtHER maintains 'consistent performance across structural similarities,' yet supplies no details on baseline detectors, statistical significance tests, dataset sizes, or the exact metric for structural similarity (Abstract). These omissions prevent verification of the consistency claim and comparison fairness.
Authors: We acknowledge that the abstract, due to length constraints, omitted several implementation details. The full manuscript specifies: baseline detectors (Section 4.2, including sentence-level factuality models and token-level hallucination detectors), dataset sizes (Section 3.3: 12,000 training, 3,000 validation, and 5,000 test samples), statistical significance (paired t-tests with p < 0.01 reported in Section 5.3), and the structural similarity metric (cosine similarity of sentence embeddings from a domain-adapted BioBERT model, with performance reported across five similarity bins in Figure 4). To improve readability and verifiability, we will expand the abstract with concise references to these elements (e.g., “outperforms baselines including X and Y by >15% with p < 0.01”) and ensure the structural similarity definition is stated in the abstract or immediately following it. revision: yes
- Creation of an independent human-expert-labeled test set of real LLM fabrications on clinical queries, which would require new data collection and annotation efforts outside the scope and resources of the current study.
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper presents a data-centric pipeline to create the MedFabric dataset of word-level fabrications and then trains/evaluates the EtHER detector on benchmarks derived from that pipeline. No equations, self-definitional reductions, or fitted parameters renamed as predictions appear in the provided text. The central empirical claim (performance gains on word-level benchmarks) is evaluated against external SOTA detectors rather than reducing to the input distribution by construction. Any self-citations are not load-bearing for the core results, and the work is self-contained as a new dataset plus detector with reported comparisons. This matches the default expectation of no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Findings of the Association for Computational Linguistics: EMNLP 2023
Azaria, A., Mitchell, T.: The internal state of an LLM knows when it’s lying. In: Findings of the Association for Computational Linguistics: EMNLP 2023
2023
-
[2]
CIKM ’23, Association for Computing Machinery (2023) 12 Tung Sum Thomas et al
Chen,Y.,etc:Hallucinationdetection:Robustlydiscerningreliableanswersinlarge language models. CIKM ’23, Association for Computing Machinery (2023) 12 Tung Sum Thomas et al
2023
-
[3]
In: ICML’ 24
Chiang, W., etc: Chatbot arena: An open platform for evaluating llms by human preference. In: ICML’ 24
-
[4]
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models (2023)
2023
-
[5]
AAAI’25/IAAI’25/EAAI’25
Fang, X., etc: Zero-resource hallucination detection for text generation via graph- based contextual knowledge triples modeling. AAAI’25/IAAI’25/EAAI’25
-
[6]
In: EMNLP ’20 (2020)
Filippova, K.: Controlled hallucinations: Learning to generate faithfully from noisy data. In: EMNLP ’20 (2020)
2020
-
[7]
Gu, J., etc: A survey on llm-as-a-judge (2025),https://arxiv.org/abs/2411. 15594
2025
-
[8]
ACM Trans
Huang, L., etc: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2) (Jan 2025)
2025
- [9]
- [10]
-
[11]
Jin, D., etc: What disease does this patient have? a large-scale open domain ques- tion answering dataset from medical exams. arXiv:2009.13081 (2020)
-
[12]
In: ACL ’23’ (2023)
Kamalloo, E.e.: Evaluating open-domain question answering in the era of large language models. In: ACL ’23’ (2023)
2023
-
[13]
Katz, D.M., Bommarito, M.J., Gao, S., Arredondo, P.: Gpt-4 passes the bar exam (2023), retrieved from Data Science Association
2023
-
[14]
medRxiv (2025)
Kim, Y., etc: Medical hallucination in foundation models and their impact on healthcare. medRxiv (2025)
2025
-
[15]
In: ICML’ 23 (2023)
Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In: ICML’ 23 (2023)
2023
-
[16]
Laban, P., Hayashi, H., Zhou, Y., Neville, J.: Llms get lost in multi-turn conver- sation (2025),https://arxiv.org/abs/2505.06120
work page internal anchor Pith review arXiv 2025
-
[17]
In: EMNLP’ 23 (2023)
Li, J., etc: HaluEval: A large-scale hallucination evaluation benchmark for large language models. In: EMNLP’ 23 (2023)
2023
-
[18]
Cureus (2023)
Li, Y., etc: Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus (2023)
2023
-
[19]
In: Proceed- ings of the ACL Workshop: Text Summarization Braches Out 2004 (2004)
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proceed- ings of the ACL Workshop: Text Summarization Braches Out 2004 (2004)
2004
-
[20]
In: ACL’ 22 (2022)
Lin, S., Hilton, J., Evans, O.: TruthfulQA: Measuring how models mimic human falsehoods. In: ACL’ 22 (2022)
2022
-
[21]
In: NAACL’ 25 (2025)
Liu, S., etc: Towards long context hallucination detection. In: NAACL’ 25 (2025)
2025
-
[22]
In: EMNLP’ 23 (Dec 2023)
Manakul, P., etc: SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In: EMNLP’ 23 (Dec 2023)
2023
-
[23]
OpenAI: Gpt-5 system card. Tech. rep., OpenAI (Aug 2025)
2025
-
[24]
Faray de Paiva, L., etc: How does deepseek-r1 perform on usmle? medRxiv (2025)
2025
-
[25]
In: CHIL’ 22 (2022)
Pal, A., etc: Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: CHIL’ 22 (2022)
2022
-
[26]
In: CoNLL ’23’ (2023)
Pal, A., etc: Med-HALT: Medical domain hallucination test for large language models. In: CoNLL ’23’ (2023)
2023
-
[27]
In: ICML’ 205 (2025)
Park, S., etc: Steer LLM latents for hallucination detection. In: ICML’ 205 (2025)
2025
- [28]
-
[29]
Singhal, K., etc: Large language models encode clinical knowledge. Nature 620(7972), 172–180 (2023).https://doi.org/10.1038/s41586-023-06291-2 Title Suppressed Due to Excessive Length 13
-
[30]
IEEE Trans
Wang, L., etc: A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell46(8), 5362–5383 (2024)
2024
- [31]
- [32]
- [33]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.