A multilingual hallucination benchmark: MultiWikiQHalluA
Pith reviewed 2026-05-09 16:05 UTC · model grok-4.3
The pith
Smaller LLMs produce more hallucinations than larger ones, especially in lower-resource languages like Icelandic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using classifiers trained on synthetic faithfulness hallucination data from the MultiWikiQA dataset, the authors find that Qwen3-0.6B produces answers with at least one hallucination in up to 60 percent of cases, peaking in Icelandic, while larger models such as cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B achieve lower rates across most languages, and hallucination rates remain consistently higher for lower-resource languages.
What carries the argument
Token-level hallucination classifiers trained on synthetic faithfulness hallucination datasets generated via the LettuceDetect framework from the MultiWikiQA dataset for 306 languages.
If this is right
- Larger models exhibit lower hallucination rates than smaller ones across the tested languages.
- Hallucination rates increase for lower-resource languages compared with high-resource ones.
- The benchmark supports evaluation of hallucination behavior in 30 European languages beyond English.
- Models such as cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B show the lowest hallucination rates on most languages tested.
Where Pith is reading between the lines
- Model training should incorporate more data from low-resource languages to address elevated hallucination rates.
- Synthetic data methods can extend hallucination detection to languages where direct human labels are limited.
- The size-dependent pattern suggests that scaling alone may leave gaps in multilingual reliability.
Load-bearing premise
The synthetic hallucinations created from MultiWikiQA data accurately reflect the distribution and types of faithfulness errors that real models make in non-English languages.
What would settle it
Human annotation of model-generated answers in Icelandic to check whether the classifier-identified hallucination rates and locations match actual factual divergences or internal inconsistencies.
Figures
read the original abstract
Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages. In this work, we present evaluations of model hallucinations on a selection of languages: English, Danish, German, and Icelandic. Using these classifiers, we evaluate the hallucination rates for Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Our classifiers reveal notably higher hallucination rates for Qwen3-0.6B (up to 60\% of answers containing at least one hallucination, peaking in Icelandic) and generally lower rates for larger models, with cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B performing best on most languages. Hallucination rates are consistently higher for lower-resource languages, particularly Icelandic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MultiWikiQHalluA, a multilingual benchmark for faithfulness hallucinations. Using the MultiWikiQA dataset and LettuceDetect framework, it generates synthetic hallucination data for 306 languages, trains token-level classifiers for 30 European languages, and evaluates hallucination rates for five LLMs (Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, cogito-v1-preview-llama-70B) on English, Danish, German, and Icelandic. Key claims include up to 60% hallucination rates for Qwen3-0.6B (peaking in Icelandic), lower rates for larger models, and consistently higher rates in lower-resource languages.
Significance. If the synthetic data and classifiers reliably capture real hallucinations, the work offers a scalable method to extend hallucination evaluation beyond English, potentially aiding development of more robust multilingual models and highlighting language-specific reliability gaps. The scale (306 languages for data generation) is a notable strength, though the central measurements depend on unvalidated assumptions about synthetic data fidelity.
major comments (3)
- [Methods] Methods (synthetic data generation): The classifiers are trained exclusively on LettuceDetect synthetic hallucinations, yet no quantitative validation (human annotations, inter-annotator agreement, or direct comparison to real model outputs in Danish/German/Icelandic) is reported to confirm that these examples match the distribution and linguistic cues of genuine faithfulness errors; this is load-bearing for all reported rates.
- [Results] Results (hallucination rate measurements): The headline percentages (e.g., 60% for Qwen3-0.6B) are presented without error bars, confidence intervals, or statistical significance tests, and no details are given on classifier evaluation (train/test splits, F1/precision/recall on held-out synthetic or real data), leaving the quantitative claims without visible empirical support.
- [Evaluation] Evaluation setup: The paper evaluates only four languages despite generating data for 306 and classifiers for 30; no justification or ablation is provided for this selection, nor any cross-lingual transfer analysis to support generalization to Icelandic (the lowest-resource language highlighted).
minor comments (2)
- [Abstract] Abstract: The claim of 'notably higher hallucination rates' for Qwen3-0.6B would benefit from explicit comparison numbers for the other models rather than qualitative descriptors.
- [Introduction] Notation: 'Faithfulness hallucinations' is defined but the precise operationalization (e.g., how internal inconsistency vs. input divergence is labeled in synthetic data) could be clarified with an example.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and explaining our approach where we maintain the original design. We will incorporate changes in the revised version.
read point-by-point responses
-
Referee: [Methods] Methods (synthetic data generation): The classifiers are trained exclusively on LettuceDetect synthetic hallucinations, yet no quantitative validation (human annotations, inter-annotator agreement, or direct comparison to real model outputs in Danish/German/Icelandic) is reported to confirm that these examples match the distribution and linguistic cues of genuine faithfulness errors; this is load-bearing for all reported rates.
Authors: We acknowledge that explicit validation of the synthetic data against real model outputs for non-English languages would strengthen the work. The LettuceDetect framework generates synthetic faithfulness hallucinations via targeted perturbations of reference answers, following established practices in hallucination detection research. No human annotations or inter-annotator agreement studies were performed for Danish, German, or Icelandic in this study. In the revised manuscript we will add a dedicated subsection describing the synthetic generation process in detail, reference any validation results from the original LettuceDetect paper (primarily English), and include an explicit limitations paragraph stating the assumption that synthetic examples approximate real hallucinations. This makes the methodological reliance transparent without claiming unperformed validation. revision: partial
-
Referee: [Results] Results (hallucination rate measurements): The headline percentages (e.g., 60% for Qwen3-0.6B) are presented without error bars, confidence intervals, or statistical significance tests, and no details are given on classifier evaluation (train/test splits, F1/precision/recall on held-out synthetic or real data), leaving the quantitative claims without visible empirical support.
Authors: We agree that the results would be more robust with additional statistical details. The classifiers were trained using standard 80/20 train/test splits on the synthetic data for each language. In the revision we will report per-language F1, precision, and recall on held-out synthetic test sets. For the reported hallucination rates we will add bootstrap-derived confidence intervals and conduct statistical significance tests (e.g., McNemar tests) comparing rates across models and languages. These additions will provide the requested empirical support for the headline figures. revision: yes
-
Referee: [Evaluation] Evaluation setup: The paper evaluates only four languages despite generating data for 306 and classifiers for 30; no justification or ablation is provided for this selection, nor any cross-lingual transfer analysis to support generalization to Icelandic (the lowest-resource language highlighted).
Authors: The four languages were selected to span a resource spectrum (English and German as high-resource, Danish as medium, Icelandic as lower-resource) among the 30 European languages for which classifiers were trained. We will add a justification subsection explaining this choice based on data availability and the desire to highlight lower-resource behavior. We will also include an ablation showing classifier performance trends across resource levels for all 30 languages and a brief discussion of implications for cross-lingual generalization to Icelandic. Full evaluation on all 306 languages is outside the current scope, as classifiers were trained only for the 30 languages with sufficient MultiWikiQA coverage. revision: partial
Circularity Check
No significant circularity in the reported hallucination rates
full rationale
The paper's chain consists of generating synthetic hallucination examples via the external LettuceDetect framework on MultiWikiQA, training token-level classifiers on those examples for 30 languages, and then applying the classifiers to measure hallucination rates in outputs from five evaluated models on a subset of languages. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The measured rates are direct applications of the classifiers to model-generated answers and do not reduce to the synthetic training inputs by construction; any mismatch between synthetic and real hallucinations is an external validity concern rather than circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Faithfulness hallucinations are defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent.
Reference graph
Works this paper leans on
-
[1]
We adopt the definition of faithfulness hallucinations as proposed by Huang et al
Introduction LargeLanguageModels(LLMs)arepronetogener- atingfluentyetfalseoutputs, whichisknownashal- lucinations. We adopt the definition of faithfulness hallucinations as proposed by Huang et al. (2025): a language model generates fluent and plausible content that diverges from the given input/prompt, orisinternallyinconsistent. Forexample,ifamodel is a...
work page 2025
-
[2]
A multilingual hallucination benchmark: MultiWikiQHalluA
Related Work Hallucinations in language model outputs are com- monly categorised into two types: factuality and faithfulness (Huang et al., 2025). Factuality halluci- nations involve claims that contradict established worldknowledge(e.g.statingthattheEiffelToweris in London). Faithfulness hallucinations occur when generated text diverges from a provided s...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
These approaches primar- ily test world knowledge and may miss context- grounded errors
probes susceptibility to common misconcep- tions; HaluEval (Li et al., 2023) benchmarks hallu- cination detection across QA, summarisation, and dialogue; HalluLens (Bang et al., 2025) provides a broad multi-task evaluation of LLM hallucinations; and SimpleQA (Wei et al., 2024) measures short- form factual accuracy. These approaches primar- ily test world ...
work page 2023
-
[4]
Methods LettuceDetect (Kovacs and Recski, 2025) is a tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It generates a halluci- nation dataset based on the dataset RagTruth (Niu et al., 2024) and then trains a binary token-level classifier on it. This trained model can then be used to detect hallucinations in LLM-generated ...
work page 2025
-
[5]
Discussion Across all models, high-resource languages (En- glish and German) exhibit consistently lower hallu- cination rates than the lower-resource languages Danish and Icelandic, with Icelandic showing the highestrates. Forthehigh-resourcelanguages,the Model Language Supported-F1 Unsupported-F1 Accuracy Ettin-17m Danish 0.8239 0.6560 0.7670 EuroBERT-21...
work page 2021
-
[6]
Conclusion In this work, we presented a multilingual halluci- nation benchmark leveraging the LettuceDetect framework and the MultiWikiQA dataset. We re- leased a synthetic hallucination dataset for 306 languages and token-level hallucination classifiers for 30 European languages, and evaluated five lan- guagemodels(Qwen3-0.6B,Qwen3-14B,Gemma- 3-12B-IT, c...
-
[7]
Resources All resources are publicly available. Note that the datasetcovers 306 languages (the full scope of MultiWikiQA), theclassifiersare released for 30 European languages (the subset for which we fine- tuned models), and theevaluationsin this paper cover four languages (English, Danish, German, and Icelandic). • Dataset: The synthetic hallucination d...
-
[8]
Acknowledgements This research was funded by the EU Horizon project TrustLLM (grant agreement number 101135671)
-
[9]
Bibliographical References Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156, Vienna, Austria. Association for ...
work page 2025
-
[10]
Asurveyonhallucinationinlargelanguage models: Principles, taxonomy, challenges, and open questions.ACM T ransactions on Informa- tion Systems, 43(2):1–55. Akos Kovacs and Gabor Recski. 2025. Lettucede- tect: A hallucination detection framework for rag applications. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian- Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A ...
work page 2025
-
[11]
TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics, pages 3214–3252. Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black- box hallucination detection for generative large language models. Matt Marone, Oren Weller, Wil...
work page 2023
-
[12]
mmbert: A modern multilingual encoder with annealed language learning. SewonMin,KalpeshKrishna,XinxiLyu,MikeLewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empir- ical Methods...
work page 2023
-
[13]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
Association for Computational Linguistics. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language T echnologies, pages 809–819....
-
[14]
Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu
Seq vs seq: An open suite of paired en- coders and decoders. Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu. 2025. Qwen3 technical report. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert
work page 2025
-
[15]
Language Resource References Cheng Niu and Yuanhao Wu and Juno Zhu and Sil- iangXuandKashunShumandRandyZhongand Juntong Song and Tong Zhang. 2024.RAGT ruth: A Hallucination Corpus for Developing T rustwor- thy Retrieval-Augmented Language Models. Dan Saattrup Smart. 2025.MultiWikiQA: A Read- ing Comprehension Benchmark in 300+ Lan- guages
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.