AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages
Pith reviewed 2026-05-13 22:52 UTC · model grok-4.3
The pith
AfrIFact dataset shows embedding models lack cross-lingual retrieval for fact-checking in ten African languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce AfrIFact to cover information retrieval, evidence extraction, and fact checking for claims in ten African languages and English. Results indicate that even the strongest embedding models lack cross-lingual retrieval abilities, cultural and news documents retrieve more readily than those in healthcare, and LLMs show weak multilingual fact-verification skills that improve substantially with few-shot prompting and task-specific fine-tuning.
What carries the argument
AfrIFact dataset providing benchmarks for retrieval and verification across culture, news, and healthcare domains in low-resource African languages.
Load-bearing premise
The AfrIFact dataset and its focus on culture, news, and healthcare domains adequately capture the real-world retrieval and verification challenges faced in the ten African languages.
What would settle it
Finding that top embedding models achieve strong cross-lingual retrieval performance on the AfrIFact dataset would contradict the reported lack of capabilities.
Figures
read the original abstract
Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AfrIFact, a new dataset covering information retrieval, evidence extraction, and fact-checking tasks for ten African languages plus English. It reports that state-of-the-art embedding models lack cross-lingual retrieval ability, that cultural and news documents are easier to retrieve than healthcare documents, and that LLMs show weak multilingual fact-verification performance which improves by up to 43% with few-shot prompting (AfriqueQwen-14B) and up to 26% with task-specific fine-tuning.
Significance. If the dataset construction and evaluations hold, the work provides a valuable benchmark for low-resource African-language fact-checking and highlights concrete model limitations in cross-lingual retrieval and verification. The public release of AfrIFact could stimulate targeted research in these languages, where current systems are weakest.
major comments (4)
- [Dataset Construction] Dataset section: the manuscript provides no statistics on total claims, documents per language, or per-domain splits, nor any inter-annotator agreement figures or annotation guidelines; these details are required to evaluate whether the reported retrieval and fact-checking numbers are reliable.
- [Evaluation] Experimental results: the headline gains (43% few-shot, 26% fine-tuning) are stated without naming the precise baseline models, prompting templates, or full set of comparison systems, so the magnitude and robustness of the improvements cannot be verified from the given information.
- [Information Retrieval Experiments] Retrieval evaluation: the claim that even the best embedding models lack cross-lingual capabilities is not supported by tabulated per-language or per-domain metrics (e.g., Recall@K or MRR) that would allow direct comparison of monolingual versus cross-lingual performance.
- [Dataset Construction] Dataset representativeness: no external validation is described that compares the collected claims against independently gathered social-media or dialectal misinformation corpora in the ten languages; without this, the domain difficulty ordering and model-gap conclusions risk being artifacts of the particular claim selection process.
minor comments (1)
- [Abstract] The abstract states specific percentage improvements without cross-references to the tables or figures that contain the underlying numbers.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility of the dataset and experiments.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset section: the manuscript provides no statistics on total claims, documents per language, or per-domain splits, nor any inter-annotator agreement figures or annotation guidelines; these details are required to evaluate whether the reported retrieval and fact-checking numbers are reliable.
Authors: We agree these details are necessary for assessing reliability. The revised version will include a new table with total claims, documents per language, and per-domain splits. We will also report inter-annotator agreement (Cohen's kappa) and include the full annotation guidelines as an appendix. revision: yes
-
Referee: [Evaluation] Experimental results: the headline gains (43% few-shot, 26% fine-tuning) are stated without naming the precise baseline models, prompting templates, or full set of comparison systems, so the magnitude and robustness of the improvements cannot be verified from the given information.
Authors: We will expand the experimental section to explicitly name all baseline models (including zero-shot and few-shot variants of mT5, BLOOMZ, Llama-2, and AfriqueQwen), provide the exact prompting templates in the appendix, and include a comprehensive results table with all comparison systems and their scores. revision: yes
-
Referee: [Information Retrieval Experiments] Retrieval evaluation: the claim that even the best embedding models lack cross-lingual capabilities is not supported by tabulated per-language or per-domain metrics (e.g., Recall@K or MRR) that would allow direct comparison of monolingual versus cross-lingual performance.
Authors: We will add detailed per-language and per-domain tables reporting Recall@K and MRR for both monolingual and cross-lingual retrieval settings. These tables will directly support the claim by showing the performance gaps across languages and domains. revision: yes
-
Referee: [Dataset Construction] Dataset representativeness: no external validation is described that compares the collected claims against independently gathered social-media or dialectal misinformation corpora in the ten languages; without this, the domain difficulty ordering and model-gap conclusions risk being artifacts of the particular claim selection process.
Authors: We acknowledge that a full external validation against independent corpora is not feasible given the scarcity of such resources for all ten languages. In the revision we will add a limitations paragraph describing the claim sourcing process from verified social media and news sources in the target languages and discuss how this affects generalizability of the domain difficulty findings. revision: partial
Circularity Check
No circularity: purely empirical dataset construction and benchmarking
full rationale
The paper introduces the AfrIFact dataset covering retrieval, evidence extraction, and fact-checking across ten African languages plus English, then reports direct empirical results on embedding models and LLMs (e.g., cross-lingual retrieval gaps, domain difficulty differences, and accuracy gains from few-shot prompting or fine-tuning). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; all claims are measured outcomes on the constructed data rather than reductions to inputs by construction. The representativeness concern raised by the skeptic is a validity issue, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected ten African languages and three domains capture representative challenges for fact-checking in low-resource settings
Reference graph
Works this paper leans on
-
[1]
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 2732–2757, Albuquerque, New Mexico. Association for Compu- tational Linguistics. ...
work page 2025
-
[2]
Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages.arXiv preprint arXiv:2508.14913. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language mod- els be too big? InProceedings of the 2021 ACM Con...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Detecting check-worthy factual claims in pres- idential debates.Proceedings of the 24th ACM Inter- national on Conference on Information and Knowl- edge Management. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3...
-
[4]
Monolingual Corpus on Health: Queries from each language only search for relevant doc- uments from documents in the same lan- guage under the health domain, e.g., Amharic queries are only searched against Amharic documents
-
[5]
Multilingual Corpus on Health: Queries in each language are searched against documents in all 11 languages. As documents are paral- lel across languages, now each query has 11 copies of relevant documents, which all con- tribute to the evaluation of retrieval results. All documents are still from the health do- main
-
[6]
Evaluations stays the same asMulti- lingual Corpus on Health
Multilingual Corpus on Health and Culture- News:[in main table]Queries in each lan- guage are additionally searched against doc- uments in all languages from culture-news domains. Evaluations stays the same asMulti- lingual Corpus on Health
-
[7]
Multilingual Corpus on Health and Culture- News (eval on monolingual ground-truth:[in main table]Identical corpus as above, but exclude relevant documents from non-query- language in retrieval results and ground-truth during evaluation. In the above settings, the retrieval corpus grad- ually moved from monolingual mono-domain to multilingual multi-domain,...
-
[8]
Read the claim and candidate evidence sentences
-
[9]
Determine if evidence supports or refutes the claim
-
[10]
Combine multiple sentences if needed
-
[11]
If no sufficient evidence exists, labelNOT ENOUGH INFO. Rule of Thumb If only the selected evidence is given, can the claim be verified as true or false? If not, label NOT ENOUGH INFO. Edge Case Rules • Avoid ambiguous claims (e.g., many, several, popu- lar). • Distinguish between actors and fictional characters. • Filmographies and lists are not exhausti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.