AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

Atnafu Lambebo Tonja; Blessing Kudzaishe Sibanda; Bontu Fufa Balcha; Crystina Zhang; Daud Abolade; David Ifeoluwa Adelani; Davis David; Dietrich Klakow; Folasade Peace Alabi; Iffat Maab

arxiv: 2604.00706 · v2 · submitted 2026-04-01 · 💻 cs.CL

AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

Israel Abebe Azime , Jesujoba Oluwadara Alabi , Crystina Zhang , Iffat Maab , Atnafu Lambebo Tonja , Tadesse Destaw Belay , Folasade Peace Alabi , Salomey Osei

show 11 more authors

Saminu Mohammad Aliyu Nkechinyere Faith Aguobi Bontu Fufa Balcha Blessing Kudzaishe Sibanda Davis David Mouhamadane Mboup Daud Abolade Neo Putini Philipp Slusallek David Ifeoluwa Adelani Dietrich Klakow

This is my paper

Pith reviewed 2026-05-13 22:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords African languagesfact checkinginformation retrievalevidence extractionlow-resource NLPmultilingual modelsdataset benchmark

0 comments

The pith

AfrIFact dataset shows embedding models lack cross-lingual retrieval for fact-checking in ten African languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AfrIFact, a dataset designed to support the full pipeline of automatic fact-checking in ten African languages and English. Evaluations demonstrate that leading embedding models cannot effectively handle cross-lingual retrieval tasks. Cultural and news content proves easier to retrieve than healthcare material, whether in large collections or single documents. Large language models also struggle with multilingual fact verification, although few-shot prompting raises performance by as much as 43 percent and targeted fine-tuning adds up to 26 percent more accuracy.

Core claim

We introduce AfrIFact to cover information retrieval, evidence extraction, and fact checking for claims in ten African languages and English. Results indicate that even the strongest embedding models lack cross-lingual retrieval abilities, cultural and news documents retrieve more readily than those in healthcare, and LLMs show weak multilingual fact-verification skills that improve substantially with few-shot prompting and task-specific fine-tuning.

What carries the argument

AfrIFact dataset providing benchmarks for retrieval and verification across culture, news, and healthcare domains in low-resource African languages.

Load-bearing premise

The AfrIFact dataset and its focus on culture, news, and healthcare domains adequately capture the real-world retrieval and verification challenges faced in the ten African languages.

What would settle it

Finding that top embedding models achieve strong cross-lingual retrieval performance on the AfrIFact dataset would contradict the reported lack of capabilities.

Figures

Figures reproduced from arXiv: 2604.00706 by Atnafu Lambebo Tonja, Blessing Kudzaishe Sibanda, Bontu Fufa Balcha, Crystina Zhang, Daud Abolade, David Ifeoluwa Adelani, Davis David, Dietrich Klakow, Folasade Peace Alabi, Iffat Maab, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Mouhamadane Mboup, Neo Putini, Nkechinyere Faith Aguobi, Philipp Slusallek, Salomey Osei, Saminu Mohammad Aliyu, Tadesse Destaw Belay.

**Figure 2.** Figure 2: Illustration of the data construction process [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: nDCG@10 scores on the Health domain when retrieving from different corpora and evaluated on relevant [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Average accuracy scores on African languages of different language models on the AfrIFact fact-checking [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Why is evidence not helping improve accuracy? Evidence introduces a conservative shift in model predictions: in Health, it reduces hallucinated SUPPORTS but increases NOT_ENOUGH_INFORMATION predictions, while in Culture, it significantly improves NEI detection by reducing false SUPPORTS classifications. In the Health domain NOT_ENOUGH_INFORMATION and SUPPORTS improve while in the culture domain, the model … view at source ↗

**Figure 6.** Figure 6: Distribution of COMET scores for claim and document translations across African languages. The upper grid shows histograms of COMET scores for claim translations across ten languages, while the lower grid presents score distributions for document translations in five languages (Igbo, Oromo, Shona, Twi, and Wolof). This fixed or false alarm percentages items that has below 0.6 COMET scores colored red and a… view at source ↗

**Figure 7.** Figure 7: Distribution of COMET scores for document translations across African languages. This fixed or false alarm percentages items that has below 0.6 COMET scores colored red and addressed by annotators. 0.0 0.2 0.4 0.6 0.8 1.0 COMET Score 0 50 100 150 200 250 300 350 400 Frequency Doc Translation COMET - Igbo 0.0 0.2 0.4 0.6 0.8 1.0 COMET Score 0 100 200 300 400 500 600 Frequency Doc Translation COMET - Oromo 0… view at source ↗

**Figure 8.** Figure 8: Example interface of the customized tool used [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Example of the customized annotation in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AfrIFact supplies a new dataset for the full fact-checking pipeline in ten African languages, but the reported model gaps rest on claims whose real-world coverage still needs external checks.

read the letter

The main takeaway is that this paper releases AfrIFact, a dataset that walks through information retrieval, evidence extraction, and fact verification for ten African languages plus English. That resource fills a documented gap and gives people concrete material to work with in low-resource settings. The evaluations are straightforward: best embedding models show weak cross-lingual retrieval, healthcare documents prove harder to pull than cultural or news ones, and LLMs improve with few-shot prompting (up to 43% on AfriqueQwen-14B) and task-specific fine-tuning (up to 26%). Releasing the data is the clearest contribution here, since it lets others test and extend the pipeline without starting from scratch. The domain split also surfaces a practical point that healthcare material is tougher, which aligns with the stakes mentioned in the abstract. The soft spot sits in how representative the dataset actually is. The claims about model shortcomings depend on the selected examples reflecting typical misinformation patterns, yet the work does not appear to include external validation against independent social-media collections or dialectal variants. Without those checks, the size of the reported gaps could partly reflect construction choices rather than general language properties. Dataset sizes, annotation agreement figures, and full baseline tables would help readers judge the numbers more firmly. This is the kind of paper that belongs in a reading group for multilingual IR and fact-checking groups, especially those focused on Africa. It deserves peer review because the resource itself is new and the identified weaknesses point to concrete next steps, even if the experiments will need tightening on validation.

Referee Report

4 major / 1 minor

Summary. The paper introduces AfrIFact, a new dataset covering information retrieval, evidence extraction, and fact-checking tasks for ten African languages plus English. It reports that state-of-the-art embedding models lack cross-lingual retrieval ability, that cultural and news documents are easier to retrieve than healthcare documents, and that LLMs show weak multilingual fact-verification performance which improves by up to 43% with few-shot prompting (AfriqueQwen-14B) and up to 26% with task-specific fine-tuning.

Significance. If the dataset construction and evaluations hold, the work provides a valuable benchmark for low-resource African-language fact-checking and highlights concrete model limitations in cross-lingual retrieval and verification. The public release of AfrIFact could stimulate targeted research in these languages, where current systems are weakest.

major comments (4)

[Dataset Construction] Dataset section: the manuscript provides no statistics on total claims, documents per language, or per-domain splits, nor any inter-annotator agreement figures or annotation guidelines; these details are required to evaluate whether the reported retrieval and fact-checking numbers are reliable.
[Evaluation] Experimental results: the headline gains (43% few-shot, 26% fine-tuning) are stated without naming the precise baseline models, prompting templates, or full set of comparison systems, so the magnitude and robustness of the improvements cannot be verified from the given information.
[Information Retrieval Experiments] Retrieval evaluation: the claim that even the best embedding models lack cross-lingual capabilities is not supported by tabulated per-language or per-domain metrics (e.g., Recall@K or MRR) that would allow direct comparison of monolingual versus cross-lingual performance.
[Dataset Construction] Dataset representativeness: no external validation is described that compares the collected claims against independently gathered social-media or dialectal misinformation corpora in the ten languages; without this, the domain difficulty ordering and model-gap conclusions risk being artifacts of the particular claim selection process.

minor comments (1)

[Abstract] The abstract states specific percentage improvements without cross-references to the tables or figures that contain the underlying numbers.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility of the dataset and experiments.

read point-by-point responses

Referee: [Dataset Construction] Dataset section: the manuscript provides no statistics on total claims, documents per language, or per-domain splits, nor any inter-annotator agreement figures or annotation guidelines; these details are required to evaluate whether the reported retrieval and fact-checking numbers are reliable.

Authors: We agree these details are necessary for assessing reliability. The revised version will include a new table with total claims, documents per language, and per-domain splits. We will also report inter-annotator agreement (Cohen's kappa) and include the full annotation guidelines as an appendix. revision: yes
Referee: [Evaluation] Experimental results: the headline gains (43% few-shot, 26% fine-tuning) are stated without naming the precise baseline models, prompting templates, or full set of comparison systems, so the magnitude and robustness of the improvements cannot be verified from the given information.

Authors: We will expand the experimental section to explicitly name all baseline models (including zero-shot and few-shot variants of mT5, BLOOMZ, Llama-2, and AfriqueQwen), provide the exact prompting templates in the appendix, and include a comprehensive results table with all comparison systems and their scores. revision: yes
Referee: [Information Retrieval Experiments] Retrieval evaluation: the claim that even the best embedding models lack cross-lingual capabilities is not supported by tabulated per-language or per-domain metrics (e.g., Recall@K or MRR) that would allow direct comparison of monolingual versus cross-lingual performance.

Authors: We will add detailed per-language and per-domain tables reporting Recall@K and MRR for both monolingual and cross-lingual retrieval settings. These tables will directly support the claim by showing the performance gaps across languages and domains. revision: yes
Referee: [Dataset Construction] Dataset representativeness: no external validation is described that compares the collected claims against independently gathered social-media or dialectal misinformation corpora in the ten languages; without this, the domain difficulty ordering and model-gap conclusions risk being artifacts of the particular claim selection process.

Authors: We acknowledge that a full external validation against independent corpora is not feasible given the scarcity of such resources for all ten languages. In the revision we will add a limitations paragraph describing the claim sourcing process from verified social media and news sources in the target languages and discuss how this affects generalizability of the domain difficulty findings. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction and benchmarking

full rationale

The paper introduces the AfrIFact dataset covering retrieval, evidence extraction, and fact-checking across ten African languages plus English, then reports direct empirical results on embedding models and LLMs (e.g., cross-lingual retrieval gaps, domain difficulty differences, and accuracy gains from few-shot prompting or fine-tuning). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; all claims are measured outcomes on the constructed data rather than reductions to inputs by construction. The representativeness concern raised by the skeptic is a validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or free parameters; the work rests on empirical data collection and standard model evaluations.

axioms (1)

domain assumption The selected ten African languages and three domains capture representative challenges for fact-checking in low-resource settings
Invoked in dataset construction and evaluation design

pith-pipeline@v0.9.0 · 5606 in / 1132 out tokens · 20844 ms · 2026-05-13T22:52:50.652435+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 2732–2757, Albuquerque, New Mexico. Association for Compu- tational Linguistics. ...

work page 2025
[2]

Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages.arXiv preprint arXiv:2508.14913. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language mod- els be too big? InProceedings of the 2021 ACM Con...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others

Detecting check-worthy factual claims in pres- idential debates.Proceedings of the 24th ACM Inter- national on Conference on Information and Knowl- edge Management. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3...

work page arXiv 2022
[4]

Monolingual Corpus on Health: Queries from each language only search for relevant doc- uments from documents in the same lan- guage under the health domain, e.g., Amharic queries are only searched against Amharic documents

work page
[5]

As documents are paral- lel across languages, now each query has 11 copies of relevant documents, which all con- tribute to the evaluation of retrieval results

Multilingual Corpus on Health: Queries in each language are searched against documents in all 11 languages. As documents are paral- lel across languages, now each query has 11 copies of relevant documents, which all con- tribute to the evaluation of retrieval results. All documents are still from the health do- main

work page
[6]

Evaluations stays the same asMulti- lingual Corpus on Health

Multilingual Corpus on Health and Culture- News:[in main table]Queries in each lan- guage are additionally searched against doc- uments in all languages from culture-news domains. Evaluations stays the same asMulti- lingual Corpus on Health

work page
[7]

may be”, “it is re- ported

Multilingual Corpus on Health and Culture- News (eval on monolingual ground-truth:[in main table]Identical corpus as above, but exclude relevant documents from non-query- language in retrieval results and ground-truth during evaluation. In the above settings, the retrieval corpus grad- ually moved from monolingual mono-domain to multilingual multi-domain,...

work page arXiv 2024
[8]

Read the claim and candidate evidence sentences

work page
[9]

Determine if evidence supports or refutes the claim

work page
[10]

Combine multiple sentences if needed

work page
[11]

Rule of Thumb If only the selected evidence is given, can the claim be verified as true or false? If not, label NOT ENOUGH INFO

If no sufficient evidence exists, labelNOT ENOUGH INFO. Rule of Thumb If only the selected evidence is given, can the claim be verified as true or false? If not, label NOT ENOUGH INFO. Edge Case Rules • Avoid ambiguous claims (e.g., many, several, popu- lar). • Distinguish between actors and fictional characters. • Filmographies and lists are not exhausti...

work page

[1] [1]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 2732–2757, Albuquerque, New Mexico. Association for Compu- tational Linguistics. ...

work page 2025

[2] [2]

Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages.arXiv preprint arXiv:2508.14913. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language mod- els be too big? InProceedings of the 2021 ACM Con...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others

Detecting check-worthy factual claims in pres- idential debates.Proceedings of the 24th ACM Inter- national on Conference on Information and Knowl- edge Management. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3...

work page arXiv 2022

[4] [4]

Monolingual Corpus on Health: Queries from each language only search for relevant doc- uments from documents in the same lan- guage under the health domain, e.g., Amharic queries are only searched against Amharic documents

work page

[5] [5]

As documents are paral- lel across languages, now each query has 11 copies of relevant documents, which all con- tribute to the evaluation of retrieval results

Multilingual Corpus on Health: Queries in each language are searched against documents in all 11 languages. As documents are paral- lel across languages, now each query has 11 copies of relevant documents, which all con- tribute to the evaluation of retrieval results. All documents are still from the health do- main

work page

[6] [6]

Evaluations stays the same asMulti- lingual Corpus on Health

Multilingual Corpus on Health and Culture- News:[in main table]Queries in each lan- guage are additionally searched against doc- uments in all languages from culture-news domains. Evaluations stays the same asMulti- lingual Corpus on Health

work page

[7] [7]

may be”, “it is re- ported

Multilingual Corpus on Health and Culture- News (eval on monolingual ground-truth:[in main table]Identical corpus as above, but exclude relevant documents from non-query- language in retrieval results and ground-truth during evaluation. In the above settings, the retrieval corpus grad- ually moved from monolingual mono-domain to multilingual multi-domain,...

work page arXiv 2024

[8] [8]

Read the claim and candidate evidence sentences

work page

[9] [9]

Determine if evidence supports or refutes the claim

work page

[10] [10]

Combine multiple sentences if needed

work page

[11] [11]

Rule of Thumb If only the selected evidence is given, can the claim be verified as true or false? If not, label NOT ENOUGH INFO

If no sufficient evidence exists, labelNOT ENOUGH INFO. Rule of Thumb If only the selected evidence is given, can the claim be verified as true or false? If not, label NOT ENOUGH INFO. Edge Case Rules • Avoid ambiguous claims (e.g., many, several, popu- lar). • Distinguish between actors and fictional characters. • Filmographies and lists are not exhausti...

work page