arxiv: 2604.18234 · v1 · submitted 2026-04-20 · 💻 cs.IR · cs.AI

Recognition: unknown

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

Lorenz Brehme , Thomas Str\"ohle , Ruth Breu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:58 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords RAGmulti-hop reasoningLLM-as-judgeretriever evaluationCAREHotPotQAcontext-aware evaluation

0 comments

The pith

Providing the full set of retrieved contexts improves LLM judges' accuracy in evaluating multi-hop RAG retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates different strategies for using large language models as judges to assess the quality of document retrieval in retrieval-augmented generation systems, focusing on multi-hop questions that require combining information from multiple sources. It introduces Context-Aware Retriever Evaluation (CARE), which supplies the entire collection of retrieved contexts to the judge at once rather than assessing documents in isolation. Experiments on HotPotQA, MuSiQue, and SQuAD show that CARE aligns better with expected relevance labels than previous methods, with stronger benefits for larger models that can handle longer contexts. A sympathetic reader would care because poor evaluation of retrievers can lead to unreliable RAG systems when queries are complex. Single-hop queries, by contrast, show little sensitivity to the evaluation approach.

Core claim

CARE consistently outperforms existing LLM-based evaluation methods for multi-hop reasoning in RAG systems by evaluating the collective support provided by all retrieved contexts, with performance gains most pronounced in models with larger parameter counts and longer context windows.

What carries the argument

Context-Aware Retriever Evaluation (CARE): an LLM-as-judge strategy that presents the complete retrieved context set to determine if the passages together support the answer to a multi-hop query.

If this is right

CARE yields larger accuracy improvements when applied to bigger LLMs with extended context lengths.
Minimal differences appear between methods when evaluating single-hop queries.
The results underscore the importance of context awareness for reliable assessment of RAG retrievers in complex scenarios.
The method works across LLMs from different providers such as OpenAI, Meta, and Google.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG systems could adopt similar collective context assessment during the retrieval phase itself to improve document selection for multi-hop queries.
The evaluation approach may apply to other domains involving multi-step information synthesis beyond the tested datasets.
Additional human studies could validate the LLM judgments to increase confidence in CARE's assessments.

Load-bearing premise

LLM judges can be trusted to determine whether a group of contexts collectively supports a multi-hop answer without further human validation of the judgments.

What would settle it

A study in which human annotators independently judge the same retrieval sets for multi-hop support and compare agreement rates with CARE versus baseline methods would falsify the claim if CARE shows no improvement in alignment.

Figures

Figures reproduced from arXiv: 2604.18234 by Lorenz Brehme, Ruth Breu, Thomas Str\"ohle.

**Figure 2.** Figure 2: Prompt length by approach * statistically significant difference; Base denotes baseline. Single-Hop. To assess whether the CARE approach–despite its longer prompts and higher computational cost, is still suitable for single-hop queries, we compared it to the direct evaluation method. Due to the lack of ground-truth answers for unanswerable questions in the SQuAD 2.0 dataset, we excluded the indirect appro… view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARE is a reasonable incremental tweak to LLM-as-judge for multi-hop RAG, but the experiments stay inside an idealized simulation that skips real retriever noise and human checks on the judge.

read the letter

The paper's main move is to define CARE as a context-aware LLM judge that scores whether a bundle of retrieved passages together supports a multi-hop answer, rather than judging passages in isolation. They run this against two simpler baselines on HotPotQA, MuSiQue, and SQuAD, using models from OpenAI, Meta, and Google, and report that CARE wins more often, especially with bigger models and longer contexts. They also ship the experiment data on GitHub, which is useful for anyone who wants to re-run or extend it.

Referee Report

2 major / 1 minor

Summary. The paper proposes Context-Aware Retriever Evaluation (CARE), an LLM-as-judge strategy for assessing multi-hop reasoning in RAG systems. It compares CARE against existing methods on HotPotQA, MuSiQue, and SQuAD by simulating RAG retrieval and using LLMs from OpenAI, Meta, and Google as judges. The central claim is that CARE consistently outperforms baselines, with larger gains for models having more parameters and longer context windows; single-hop queries show little difference. The manuscript provides a GitHub link for full experimental data.

Significance. If validated, the work would usefully highlight limitations of single-context evaluation for multi-hop RAG and offer a practical alternative. The multi-model experiments and public data release are strengths that support reproducibility. However, the significance is currently limited by untested assumptions about judge reliability and simulation fidelity.

major comments (2)

[Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.
[Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.

minor comments (1)

[Abstract and Introduction] The abstract and introduction refer to 'three LLM-as-judge evaluation strategies' but do not explicitly name the two baselines; a clear enumeration in §3 or §4 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and experimental design that we address below with planned revisions.

read point-by-point responses

Referee: [Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.

Authors: We agree that the absence of human validation for the LLM judges limits the strength of our claims regarding absolute reliability. While our experiments compare methods under identical judge conditions, allowing relative differences to be observed, we recognize the need for direct evidence of judge accuracy on collective entailment. In the revised manuscript, we will add a human evaluation study: a random sample of multi-hop instances from each dataset will be annotated by multiple human raters to determine whether the provided contexts collectively support the answer. We will report Cohen's kappa for inter-human agreement and human-LLM agreement, along with a qualitative error analysis of disagreements, particularly on multi-hop cases. This will be included in a new subsection of the experiments. revision: yes
Referee: [Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.

Authors: The simulation intentionally uses gold supporting facts to isolate the impact of context-aware judgment on multi-hop reasoning without introducing retrieval noise as a confounding factor. This design choice enables a controlled comparison of how judges handle distributed information across contexts. We acknowledge that this does not replicate the partial or irrelevant contexts typical of real retrievers, which restricts direct claims about performance in deployed RAG systems. In the revision, we will update the abstract, introduction, and methodology to explicitly describe the simulation as an idealized setting for evaluating multi-hop judgment strategies. We will also add a dedicated limitations paragraph discussing the gap to real retrievers and outlining future work that applies CARE to outputs from BM25 and dense retrievers on the same datasets. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of LLM judges on public datasets

full rationale

The paper conducts an empirical study comparing three LLM-as-judge strategies (including the proposed CARE) for multi-hop RAG evaluation on HotPotQA, MuSiQue, and SQuAD. It simulates retrieval by injecting dataset-provided supporting facts and measures performance via LLM scoring of collective entailment. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claim (CARE outperforms baselines) is a direct experimental outcome on fixed public data rather than a reduction to its own inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical comparison study with no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5538 in / 1066 out tokens · 51065 ms · 2026-05-10T03:58:15.686106+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 5 internal anchors

[1]

https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925

Afzal, A., Kowsik, A., Fani, R., Matthes, F.: Towards optimizing and evaluating a retrieval augmented QA chatbot using LLMs with human in the loop (2024). https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925

work page doi:10.48550/arxiv.2407.05925 2024
[2]

AI@Meta: Llama 3.1 model card (2024), https://github.com/meta-llama/llama- models/blob/main/models/llama3_1/MODEL_CARD.md

2024
[3]

https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458

Alinejad, A., Kumar, K., Vahdat, A.: Evaluating the retrieval component in LLM-based question answering systems (2024). https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458

work page doi:10.48550/arxiv.2406.06458 2024
[4]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., Li, J.: LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks (2025). https://doi.org/10.48550/arXiv.2412.15204, http://arxiv.org/abs/2412.15204

work page doi:10.48550/arxiv.2412.15204 2025
[5]

https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066

Brehme, L., Dornauer, B., Ströhle, T., Ehrhart, M., Breu, R.: Retrieval- augmented generation in industry: An interview study on use cases, require- ments, challenges, and evaluation. https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066

work page doi:10.48550/arxiv.2508.14066
[6]

https://doi.org/10.1109/SDS66131.2025.00010, https://ieeexplore.ieee.org/document/11081490, ISSN: 2835-3420

Brehme, L., Ströhle, T., Breu, R.: Can LLMs be trusted for evaluating RAG systems? a survey of methods and datasets (2025). https://doi.org/10.1109/SDS66131.2025.00010, https://ieeexplore.ieee.org/document/11081490, ISSN: 2835-3420

work page doi:10.1109/sds66131.2025.00010 2025
[7]

Language Models are Few-Shot Learners

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2020
[8]

DeepMind, G.: Gemini models | gemini API (2025), https://ai.google.dev/gemini- api/docs/models

2025
[9]

arXiv preprint arXiv:2409.03759v1 , year=

Ding, T., Banerjee, A., Mombaerts, L., Li, Y., Borogovac, T., Weinstein, J.P.D.l.C.: VERA: Validation and evaluation of retrieval-augmented systems (2024). https://doi.org/10.48550/arXiv.2409.03759, http://arxiv.org/abs/2409.03759

work page doi:10.48550/arxiv.2409.03759 2024
[10]

Chapman and Hall/CRC, New York (1994)

Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hal- l/CRC (1994). https://doi.org/10.1201/9780429246593

work page doi:10.1201/9780429246593 1994
[11]

arXiv preprint arXiv:2309.15217 (2023)

Es, S., James, J., Espinosa-Anke, L., Schockaert, S.: RAGAS: Automated evaluation of retrieval augmented generation (2023). https://doi.org/10.48550/arXiv.2309.15217, http://arxiv.org/abs/2309.15217

work page doi:10.48550/arxiv.2309.15217 2023
[12]

n/a, editor

Friel, R., Belyi, M., Sanyal, A.: RAGBench: Explainable benchmark for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2407.11005, http://arxiv.org/abs/2407.11005 14 Brehme et al

work page doi:10.48550/arxiv.2407.11005 2024
[13]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-augmented generation for large lan- guage models: A survey (2024). https://doi.org/10.48550/arXiv.2312.10997, http://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
[14]

https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291

Kukreja, S., Kumar, T., Bharate, V., Purohit, A., Dasgupta, A., Guha, D.: Performance evaluation of vector embeddings with retrieval-augmented generation (2024). https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291

work page doi:10.1109/icccs61882.2024.10603291 2024
[15]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Apr 2021), http://arxiv.org/abs/2005.11401, arXiv:2005.11401

work page internal anchor Pith review arXiv 2021
[16]

Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

Li, T., Zhang, G., Do, Q.D., Yue, X., Chen, W.: Long-context LLMs struggle with long in-context learning (2024). https://doi.org/10.48550/arXiv.2404.02060, http://arxiv.org/abs/2404.02060

work page doi:10.48550/arxiv.2404.02060 2024
[17]

arXiv preprint arXiv:2410.12248 (2024)

Liu, J., Ding, R., Zhang, L., Xie, P., Huang, F.: CoFE-RAG: A com- prehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity (2024). https://doi.org/10.48550/arXiv.2410.12248, http://arxiv.org/abs/2410.12248

work page doi:10.48550/arxiv.2410.12248 2024
[18]

Lorenz Brehme, Thomas Ströhle, R.B.: General·lorenzbrehme/CARE, https://github.com/lorenzbrehme/CARE
[19]

https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691

Moreira, G.d.S.P., Ak, R., Schifferer, B., Xu, M., Osmulski, R., Oldridge, E.: Enhancing q&a text retrieval with ranking models: Benchmarking, fine-tuning and deploying rerankers for RAG. https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691

work page doi:10.48550/arxiv.2409.07691
[20]

OpenAI: Model - OpenAI API (2025), https://platform.openai.com

2025
[21]

https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783

Rackauckas, Z., Câmara, A., Zavrel, J.: Evaluating RAG- fusion with RAGElo: an automated elo-based framework (2024). https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783

work page doi:10.48550/arxiv.2406.14783 2024
[22]

Know What You Don't Know: Unanswerable Questions for SQuAD

Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswer- able questions for SQuAD (2024). https://doi.org/10.48550/arXiv.1806.03822, http://arxiv.org/abs/1806.03822

work page Pith review doi:10.48550/arxiv.1806.03822 2024
[23]

preprint arXiv:2311.09476 (2023)

Saad-Falcon, J., Khattab, O., Potts, C., Zaharia, M.: ARES: An auto- mated evaluation framework for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2311.09476, http://arxiv.org/abs/2311.09476

work page doi:10.48550/arxiv.2311.09476 2024
[24]

https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781

Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval- augmented generation (2024). https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781

work page doi:10.48550/arxiv.2404.13781 2024
[25]

MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

Tang, Y., Yang, Y.: MultiHop-RAG: Benchmarking retrieval-augmented gener- ation for multi-hop queries (2024). https://doi.org/10.48550/arXiv.2401.15391, http://arxiv.org/abs/2401.15391

work page doi:10.48550/arxiv.2401.15391 2024
[26]

Transactions of the Association for Computational Linguistics (2022)

Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (2022)

2022
[27]

In: Proceedings of the 19th Australasian Doc- ument Computing Symposium

Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and lan- guage models examined. In: Proceedings of the 19th Australasian Doc- ument Computing Symposium. pp. 58–65. ADCS ’14, Association for Computing Machinery (2014). https://doi.org/10.1145/2682862.2682863, https://dl.acm.org/doi/10.1145/2682862.2682863 Multi-Hop Retriever Evaluation 15

work page doi:10.1145/2682862.2682863 2014
[28]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023). https://doi.org/10.48550/arXiv.2201.11903, http://arxiv.org/abs/2201.11903

work page internal anchor Pith review doi:10.48550/arxiv.2201.11903 2023
[29]

Jason Weston and Sainbayar Sukhbaatar

Wu, N., Gong, M., Shou, L., Liang, S., Jiang, D.: Large language models are diverse role-players for summarization evaluation (2023). https://doi.org/10.48550/arXiv.2303.15078, http://arxiv.org/abs/2303.15078

work page doi:10.48550/arxiv.2303.15078 2023
[30]

https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212

Xu, L., Lian, J., Zhao, W.X., Gong, M., Shou, L., Jiang, D., Xie, X., Wen, J.R.: Negative sampling for contrastive representation learning: A review. https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212

work page doi:10.48550/arxiv.2206.00212
[31]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdi- nov, R., Manning, C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answering (2018). https://doi.org/10.48550/arXiv.1809.09600, http://arxiv.org/abs/1809.09600

work page internal anchor Pith review doi:10.48550/arxiv.1809.09600 2018