A multilingual hallucination benchmark: MultiWikiQHalluA

Dan Saattrup Smart; Freja Thoresen

arxiv: 2605.02504 · v1 · submitted 2026-05-04 · 💻 cs.CL

A multilingual hallucination benchmark: MultiWikiQHalluA

Freja Thoresen , Dan Saattrup Smart This is my paper

Pith reviewed 2026-05-09 16:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucinationmultilingual benchmarkfaithfulnesslow-resource languagesLLM evaluationsynthetic datatoken-level classificationIcelandic

0 comments

The pith

Smaller LLMs produce more hallucinations than larger ones, especially in lower-resource languages like Icelandic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a multilingual benchmark to measure faithfulness hallucinations, where models generate fluent but incorrect or inconsistent content relative to the input. It leverages the MultiWikiQA dataset to create synthetic hallucination examples across 306 languages and trains token-level classifiers for 30 European languages. Evaluations of five models on English, Danish, German, and Icelandic show the smallest model reaching hallucination rates of 60 percent, with rates rising for lower-resource languages.

Core claim

Using classifiers trained on synthetic faithfulness hallucination data from the MultiWikiQA dataset, the authors find that Qwen3-0.6B produces answers with at least one hallucination in up to 60 percent of cases, peaking in Icelandic, while larger models such as cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B achieve lower rates across most languages, and hallucination rates remain consistently higher for lower-resource languages.

What carries the argument

Token-level hallucination classifiers trained on synthetic faithfulness hallucination datasets generated via the LettuceDetect framework from the MultiWikiQA dataset for 306 languages.

If this is right

Larger models exhibit lower hallucination rates than smaller ones across the tested languages.
Hallucination rates increase for lower-resource languages compared with high-resource ones.
The benchmark supports evaluation of hallucination behavior in 30 European languages beyond English.
Models such as cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B show the lowest hallucination rates on most languages tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model training should incorporate more data from low-resource languages to address elevated hallucination rates.
Synthetic data methods can extend hallucination detection to languages where direct human labels are limited.
The size-dependent pattern suggests that scaling alone may leave gaps in multilingual reliability.

Load-bearing premise

The synthetic hallucinations created from MultiWikiQA data accurately reflect the distribution and types of faithfulness errors that real models make in non-English languages.

What would settle it

Human annotation of model-generated answers in Icelandic to check whether the classifier-identified hallucination rates and locations match actual factual divergences or internal inconsistencies.

Figures

Figures reproduced from arXiv: 2605.02504 by Dan Saattrup Smart, Freja Thoresen.

**Figure 1.** Figure 1: Overview of the two-stage methodology: synthetic hallucination data generation pipeline, where MultiWikiQA contexts, questions, and ground-truth answers are passed to the LettuceDetect framework, which uses a language model to produce token-labelled hallucinated answers; and fine-tuning of the mmBERT-small tokenlevel hallucination classifier on the resulting dataset. The grey highlights the two deliver… view at source ↗

read the original abstract

Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages. In this work, we present evaluations of model hallucinations on a selection of languages: English, Danish, German, and Icelandic. Using these classifiers, we evaluate the hallucination rates for Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Our classifiers reveal notably higher hallucination rates for Qwen3-0.6B (up to 60\% of answers containing at least one hallucination, peaking in Icelandic) and generally lower rates for larger models, with cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B performing best on most languages. Hallucination rates are consistently higher for lower-resource languages, particularly Icelandic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multilingual scale is new but the reported hallucination rates rest on unvalidated synthetic data.

read the letter

The paper generates synthetic faithfulness hallucination examples for 306 languages from MultiWikiQA using LettuceDetect, trains token-level classifiers on 30 European languages, and then measures rates for five models on English, Danish, German, and Icelandic. Smaller models show higher rates, with Qwen3-0.6B reaching 60% on some languages and Icelandic standing out as worst; larger models do better overall, and lower-resource languages trend higher. That language coverage is the clearest step beyond existing English-only benchmarks. The work is straightforward in its setup and directly targets a practical gap in multilingual evaluation. The central measurements still lack visible support. The abstract and available details give no human validation of the synthetic examples, no classifier accuracy numbers, and no error bars on the percentages. The stress-test concern holds: if the LettuceDetect outputs do not match the actual error patterns real models produce in Icelandic or Danish, the language and model differences could be artifacts rather than findings. No cross-check against genuine model outputs is described. This is useful for researchers who need a starting point for non-English hallucination work and are willing to treat the numbers as preliminary. It is not yet solid enough for strong claims about scaling or deployment priorities. A serious editor should send it to review so the authors can add validation experiments or at least clarify the limits of the synthetic data.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MultiWikiQHalluA, a multilingual benchmark for faithfulness hallucinations. Using the MultiWikiQA dataset and LettuceDetect framework, it generates synthetic hallucination data for 306 languages, trains token-level classifiers for 30 European languages, and evaluates hallucination rates for five LLMs (Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, cogito-v1-preview-llama-70B) on English, Danish, German, and Icelandic. Key claims include up to 60% hallucination rates for Qwen3-0.6B (peaking in Icelandic), lower rates for larger models, and consistently higher rates in lower-resource languages.

Significance. If the synthetic data and classifiers reliably capture real hallucinations, the work offers a scalable method to extend hallucination evaluation beyond English, potentially aiding development of more robust multilingual models and highlighting language-specific reliability gaps. The scale (306 languages for data generation) is a notable strength, though the central measurements depend on unvalidated assumptions about synthetic data fidelity.

major comments (3)

[Methods] Methods (synthetic data generation): The classifiers are trained exclusively on LettuceDetect synthetic hallucinations, yet no quantitative validation (human annotations, inter-annotator agreement, or direct comparison to real model outputs in Danish/German/Icelandic) is reported to confirm that these examples match the distribution and linguistic cues of genuine faithfulness errors; this is load-bearing for all reported rates.
[Results] Results (hallucination rate measurements): The headline percentages (e.g., 60% for Qwen3-0.6B) are presented without error bars, confidence intervals, or statistical significance tests, and no details are given on classifier evaluation (train/test splits, F1/precision/recall on held-out synthetic or real data), leaving the quantitative claims without visible empirical support.
[Evaluation] Evaluation setup: The paper evaluates only four languages despite generating data for 306 and classifiers for 30; no justification or ablation is provided for this selection, nor any cross-lingual transfer analysis to support generalization to Icelandic (the lowest-resource language highlighted).

minor comments (2)

[Abstract] Abstract: The claim of 'notably higher hallucination rates' for Qwen3-0.6B would benefit from explicit comparison numbers for the other models rather than qualitative descriptors.
[Introduction] Notation: 'Faithfulness hallucinations' is defined but the precise operationalization (e.g., how internal inconsistency vs. input divergence is labeled in synthetic data) could be clarified with an example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and explaining our approach where we maintain the original design. We will incorporate changes in the revised version.

read point-by-point responses

Referee: [Methods] Methods (synthetic data generation): The classifiers are trained exclusively on LettuceDetect synthetic hallucinations, yet no quantitative validation (human annotations, inter-annotator agreement, or direct comparison to real model outputs in Danish/German/Icelandic) is reported to confirm that these examples match the distribution and linguistic cues of genuine faithfulness errors; this is load-bearing for all reported rates.

Authors: We acknowledge that explicit validation of the synthetic data against real model outputs for non-English languages would strengthen the work. The LettuceDetect framework generates synthetic faithfulness hallucinations via targeted perturbations of reference answers, following established practices in hallucination detection research. No human annotations or inter-annotator agreement studies were performed for Danish, German, or Icelandic in this study. In the revised manuscript we will add a dedicated subsection describing the synthetic generation process in detail, reference any validation results from the original LettuceDetect paper (primarily English), and include an explicit limitations paragraph stating the assumption that synthetic examples approximate real hallucinations. This makes the methodological reliance transparent without claiming unperformed validation. revision: partial
Referee: [Results] Results (hallucination rate measurements): The headline percentages (e.g., 60% for Qwen3-0.6B) are presented without error bars, confidence intervals, or statistical significance tests, and no details are given on classifier evaluation (train/test splits, F1/precision/recall on held-out synthetic or real data), leaving the quantitative claims without visible empirical support.

Authors: We agree that the results would be more robust with additional statistical details. The classifiers were trained using standard 80/20 train/test splits on the synthetic data for each language. In the revision we will report per-language F1, precision, and recall on held-out synthetic test sets. For the reported hallucination rates we will add bootstrap-derived confidence intervals and conduct statistical significance tests (e.g., McNemar tests) comparing rates across models and languages. These additions will provide the requested empirical support for the headline figures. revision: yes
Referee: [Evaluation] Evaluation setup: The paper evaluates only four languages despite generating data for 306 and classifiers for 30; no justification or ablation is provided for this selection, nor any cross-lingual transfer analysis to support generalization to Icelandic (the lowest-resource language highlighted).

Authors: The four languages were selected to span a resource spectrum (English and German as high-resource, Danish as medium, Icelandic as lower-resource) among the 30 European languages for which classifiers were trained. We will add a justification subsection explaining this choice based on data availability and the desire to highlight lower-resource behavior. We will also include an ablation showing classifier performance trends across resource levels for all 30 languages and a brief discussion of implications for cross-lingual generalization to Icelandic. Full evaluation on all 306 languages is outside the current scope, as classifiers were trained only for the 30 languages with sufficient MultiWikiQA coverage. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the reported hallucination rates

full rationale

The paper's chain consists of generating synthetic hallucination examples via the external LettuceDetect framework on MultiWikiQA, training token-level classifiers on those examples for 30 languages, and then applying the classifiers to measure hallucination rates in outputs from five evaluated models on a subset of languages. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The measured rates are direct applications of the classifiers to model-generated answers and do not reduce to the synthetic training inputs by construction; any mismatch between synthetic and real hallucinations is an external validity concern rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that the LettuceDetect framework can be applied uniformly across languages to generate faithful synthetic labels, plus the background definition of faithfulness hallucinations; no free parameters or new entities are mentioned in the abstract.

axioms (1)

domain assumption Faithfulness hallucinations are defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent.
This definition is used to guide the creation of the synthetic datasets.

pith-pipeline@v0.9.0 · 5530 in / 1257 out tokens · 33346 ms · 2026-05-09T16:05:26.258708+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

We adopt the definition of faithfulness hallucinations as proposed by Huang et al

Introduction LargeLanguageModels(LLMs)arepronetogener- atingfluentyetfalseoutputs, whichisknownashal- lucinations. We adopt the definition of faithfulness hallucinations as proposed by Huang et al. (2025): a language model generates fluent and plausible content that diverges from the given input/prompt, orisinternallyinconsistent. Forexample,ifamodel is a...

work page 2025
[2]

A multilingual hallucination benchmark: MultiWikiQHalluA

Related Work Hallucinations in language model outputs are com- monly categorised into two types: factuality and faithfulness (Huang et al., 2025). Factuality halluci- nations involve claims that contradict established worldknowledge(e.g.statingthattheEiffelToweris in London). Faithfulness hallucinations occur when generated text diverges from a provided s...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

These approaches primar- ily test world knowledge and may miss context- grounded errors

probes susceptibility to common misconcep- tions; HaluEval (Li et al., 2023) benchmarks hallu- cination detection across QA, summarisation, and dialogue; HalluLens (Bang et al., 2025) provides a broad multi-task evaluation of LLM hallucinations; and SimpleQA (Wei et al., 2024) measures short- form factual accuracy. These approaches primar- ily test world ...

work page 2023
[4]

true" sam- ple and a

Methods LettuceDetect (Kovacs and Recski, 2025) is a tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It generates a halluci- nation dataset based on the dataset RagTruth (Niu et al., 2024) and then trains a binary token-level classifier on it. This trained model can then be used to detect hallucinations in LLM-generated ...

work page 2025
[5]

Discussion Across all models, high-resource languages (En- glish and German) exhibit consistently lower hallu- cination rates than the lower-resource languages Danish and Icelandic, with Icelandic showing the highestrates. Forthehigh-resourcelanguages,the Model Language Supported-F1 Unsupported-F1 Accuracy Ettin-17m Danish 0.8239 0.6560 0.7670 EuroBERT-21...

work page 2021
[6]

Conclusion In this work, we presented a multilingual halluci- nation benchmark leveraging the LettuceDetect framework and the MultiWikiQA dataset. We re- leased a synthetic hallucination dataset for 306 languages and token-level hallucination classifiers for 30 European languages, and evaluated five lan- guagemodels(Qwen3-0.6B,Qwen3-14B,Gemma- 3-12B-IT, c...

work page
[7]

Resources All resources are publicly available. Note that the datasetcovers 306 languages (the full scope of MultiWikiQA), theclassifiersare released for 30 European languages (the subset for which we fine- tuned models), and theevaluationsin this paper cover four languages (English, Danish, German, and Icelandic). • Dataset: The synthetic hallucination d...

work page
[8]

Acknowledgements This research was funded by the EU Horizon project TrustLLM (grant agreement number 101135671)

work page
[9]

Bibliographical References Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156, Vienna, Austria. Association for ...

work page 2025
[10]

Akos Kovacs and Gabor Recski

Asurveyonhallucinationinlargelanguage models: Principles, taxonomy, challenges, and open questions.ACM T ransactions on Informa- tion Systems, 43(2):1–55. Akos Kovacs and Gabor Recski. 2025. Lettucede- tect: A hallucination detection framework for rag applications. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian- Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A ...

work page 2025
[11]

InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics, pages 3214–3252

TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics, pages 3214–3252. Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black- box hallucination detection for generative large language models. Matt Marone, Oren Weller, Wil...

work page 2023
[12]

SewonMin,KalpeshKrishna,XinxiLyu,MikeLewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi

mmbert: A modern multilingual encoder with annealed language learning. SewonMin,KalpeshKrishna,XinxiLyu,MikeLewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empir- ical Methods...

work page 2023
[13]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

Association for Computational Linguistics. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language T echnologies, pages 809–819....

work page arXiv 2018
[14]

Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu

Seq vs seq: An open suite of paired en- coders and decoders. Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu. 2025. Qwen3 technical report. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert

work page 2025
[15]

2024.RAGT ruth: A Hallucination Corpus for Developing T rustwor- thy Retrieval-Augmented Language Models

Language Resource References Cheng Niu and Yuanhao Wu and Juno Zhu and Sil- iangXuandKashunShumandRandyZhongand Juntong Song and Tong Zhang. 2024.RAGT ruth: A Hallucination Corpus for Developing T rustwor- thy Retrieval-Augmented Language Models. Dan Saattrup Smart. 2025.MultiWikiQA: A Read- ing Comprehension Benchmark in 300+ Lan- guages

work page 2024

[1] [1]

We adopt the definition of faithfulness hallucinations as proposed by Huang et al

Introduction LargeLanguageModels(LLMs)arepronetogener- atingfluentyetfalseoutputs, whichisknownashal- lucinations. We adopt the definition of faithfulness hallucinations as proposed by Huang et al. (2025): a language model generates fluent and plausible content that diverges from the given input/prompt, orisinternallyinconsistent. Forexample,ifamodel is a...

work page 2025

[2] [2]

A multilingual hallucination benchmark: MultiWikiQHalluA

Related Work Hallucinations in language model outputs are com- monly categorised into two types: factuality and faithfulness (Huang et al., 2025). Factuality halluci- nations involve claims that contradict established worldknowledge(e.g.statingthattheEiffelToweris in London). Faithfulness hallucinations occur when generated text diverges from a provided s...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

These approaches primar- ily test world knowledge and may miss context- grounded errors

probes susceptibility to common misconcep- tions; HaluEval (Li et al., 2023) benchmarks hallu- cination detection across QA, summarisation, and dialogue; HalluLens (Bang et al., 2025) provides a broad multi-task evaluation of LLM hallucinations; and SimpleQA (Wei et al., 2024) measures short- form factual accuracy. These approaches primar- ily test world ...

work page 2023

[4] [4]

true" sam- ple and a

Methods LettuceDetect (Kovacs and Recski, 2025) is a tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It generates a halluci- nation dataset based on the dataset RagTruth (Niu et al., 2024) and then trains a binary token-level classifier on it. This trained model can then be used to detect hallucinations in LLM-generated ...

work page 2025

[5] [5]

Discussion Across all models, high-resource languages (En- glish and German) exhibit consistently lower hallu- cination rates than the lower-resource languages Danish and Icelandic, with Icelandic showing the highestrates. Forthehigh-resourcelanguages,the Model Language Supported-F1 Unsupported-F1 Accuracy Ettin-17m Danish 0.8239 0.6560 0.7670 EuroBERT-21...

work page 2021

[6] [6]

Conclusion In this work, we presented a multilingual halluci- nation benchmark leveraging the LettuceDetect framework and the MultiWikiQA dataset. We re- leased a synthetic hallucination dataset for 306 languages and token-level hallucination classifiers for 30 European languages, and evaluated five lan- guagemodels(Qwen3-0.6B,Qwen3-14B,Gemma- 3-12B-IT, c...

work page

[7] [7]

Resources All resources are publicly available. Note that the datasetcovers 306 languages (the full scope of MultiWikiQA), theclassifiersare released for 30 European languages (the subset for which we fine- tuned models), and theevaluationsin this paper cover four languages (English, Danish, German, and Icelandic). • Dataset: The synthetic hallucination d...

work page

[8] [8]

Acknowledgements This research was funded by the EU Horizon project TrustLLM (grant agreement number 101135671)

work page

[9] [9]

Bibliographical References Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156, Vienna, Austria. Association for ...

work page 2025

[10] [10]

Akos Kovacs and Gabor Recski

Asurveyonhallucinationinlargelanguage models: Principles, taxonomy, challenges, and open questions.ACM T ransactions on Informa- tion Systems, 43(2):1–55. Akos Kovacs and Gabor Recski. 2025. Lettucede- tect: A hallucination detection framework for rag applications. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian- Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A ...

work page 2025

[11] [11]

InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics, pages 3214–3252

TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics, pages 3214–3252. Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black- box hallucination detection for generative large language models. Matt Marone, Oren Weller, Wil...

work page 2023

[12] [12]

SewonMin,KalpeshKrishna,XinxiLyu,MikeLewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi

mmbert: A modern multilingual encoder with annealed language learning. SewonMin,KalpeshKrishna,XinxiLyu,MikeLewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empir- ical Methods...

work page 2023

[13] [13]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

Association for Computational Linguistics. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language T echnologies, pages 809–819....

work page arXiv 2018

[14] [14]

Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu

Seq vs seq: An open suite of paired en- coders and decoders. Aohan Yang, An Li, Bo Yang, Bingchao Zhang, Bin Hui, Bo Zheng, and Zheng Qiu. 2025. Qwen3 technical report. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert

work page 2025

[15] [15]

2024.RAGT ruth: A Hallucination Corpus for Developing T rustwor- thy Retrieval-Augmented Language Models

Language Resource References Cheng Niu and Yuanhao Wu and Juno Zhu and Sil- iangXuandKashunShumandRandyZhongand Juntong Song and Tong Zhang. 2024.RAGT ruth: A Hallucination Corpus for Developing T rustwor- thy Retrieval-Augmented Language Models. Dan Saattrup Smart. 2025.MultiWikiQA: A Read- ing Comprehension Benchmark in 300+ Lan- guages

work page 2024