"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

Boxing Chen; David Alfonso-Hermelo; Ehsan Kamalloo; Jimmy Lin; Luiz Bonifacio; Mehdi Rezagholizadeh; Nandan Thakur; Odunayo Ogundepo; Qun Liu; Xiaoguang Li

arxiv: 2312.11361 · v3 · pith:QCZ3XI7Lnew · submitted 2023-12-18 · 💻 cs.CL · cs.IR

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

Nandan Thakur , Luiz Bonifacio , Xinyu Zhang , Odunayo Ogundepo , Ehsan Kamalloo , David Alfonso-Hermelo , Xiaoguang Li , Qun Liu

show 3 more authors

Boxing Chen Mehdi Rezagholizadeh Jimmy Lin

This is my paper

classification 💻 cs.CL cs.IR

keywords subsetrelevantnon-relevantnomiraclratedatasetmodelpassages

0 comments

read the original abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Retrieval-Augmented Generation Must Move Beyond Factual Grounding to Represent Diverse Opinions
cs.AI 2026-04 unverdicted novelty 5.0

Opinion-aware RAG with LLM opinion extraction and entity-linked graphs improves retrieval diversity by 26-42% over factual baselines on e-commerce forum data.
Retrieval-Augmented Generation for Large Language Models: A Survey
cs.CL 2023-12 unverdicted novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.