Recognition: unknown
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
Pith reviewed 2026-05-10 03:58 UTC · model grok-4.3
The pith
Providing the full set of retrieved contexts improves LLM judges' accuracy in evaluating multi-hop RAG retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CARE consistently outperforms existing LLM-based evaluation methods for multi-hop reasoning in RAG systems by evaluating the collective support provided by all retrieved contexts, with performance gains most pronounced in models with larger parameter counts and longer context windows.
What carries the argument
Context-Aware Retriever Evaluation (CARE): an LLM-as-judge strategy that presents the complete retrieved context set to determine if the passages together support the answer to a multi-hop query.
If this is right
- CARE yields larger accuracy improvements when applied to bigger LLMs with extended context lengths.
- Minimal differences appear between methods when evaluating single-hop queries.
- The results underscore the importance of context awareness for reliable assessment of RAG retrievers in complex scenarios.
- The method works across LLMs from different providers such as OpenAI, Meta, and Google.
Where Pith is reading between the lines
- RAG systems could adopt similar collective context assessment during the retrieval phase itself to improve document selection for multi-hop queries.
- The evaluation approach may apply to other domains involving multi-step information synthesis beyond the tested datasets.
- Additional human studies could validate the LLM judgments to increase confidence in CARE's assessments.
Load-bearing premise
LLM judges can be trusted to determine whether a group of contexts collectively supports a multi-hop answer without further human validation of the judgments.
What would settle it
A study in which human annotators independently judge the same retrieval sets for multi-hop support and compare agreement rates with CARE versus baseline methods would falsify the claim if CARE shows no improvement in alignment.
Figures
read the original abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Context-Aware Retriever Evaluation (CARE), an LLM-as-judge strategy for assessing multi-hop reasoning in RAG systems. It compares CARE against existing methods on HotPotQA, MuSiQue, and SQuAD by simulating RAG retrieval and using LLMs from OpenAI, Meta, and Google as judges. The central claim is that CARE consistently outperforms baselines, with larger gains for models having more parameters and longer context windows; single-hop queries show little difference. The manuscript provides a GitHub link for full experimental data.
Significance. If validated, the work would usefully highlight limitations of single-context evaluation for multi-hop RAG and offer a practical alternative. The multi-model experiments and public data release are strengths that support reproducibility. However, the significance is currently limited by untested assumptions about judge reliability and simulation fidelity.
major comments (2)
- [Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.
- [Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.
minor comments (1)
- [Abstract and Introduction] The abstract and introduction refer to 'three LLM-as-judge evaluation strategies' but do not explicitly name the two baselines; a clear enumeration in §3 or §4 would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and experimental design that we address below with planned revisions.
read point-by-point responses
-
Referee: [Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.
Authors: We agree that the absence of human validation for the LLM judges limits the strength of our claims regarding absolute reliability. While our experiments compare methods under identical judge conditions, allowing relative differences to be observed, we recognize the need for direct evidence of judge accuracy on collective entailment. In the revised manuscript, we will add a human evaluation study: a random sample of multi-hop instances from each dataset will be annotated by multiple human raters to determine whether the provided contexts collectively support the answer. We will report Cohen's kappa for inter-human agreement and human-LLM agreement, along with a qualitative error analysis of disagreements, particularly on multi-hop cases. This will be included in a new subsection of the experiments. revision: yes
-
Referee: [Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.
Authors: The simulation intentionally uses gold supporting facts to isolate the impact of context-aware judgment on multi-hop reasoning without introducing retrieval noise as a confounding factor. This design choice enables a controlled comparison of how judges handle distributed information across contexts. We acknowledge that this does not replicate the partial or irrelevant contexts typical of real retrievers, which restricts direct claims about performance in deployed RAG systems. In the revision, we will update the abstract, introduction, and methodology to explicitly describe the simulation as an idealized setting for evaluating multi-hop judgment strategies. We will also add a dedicated limitations paragraph discussing the gap to real retrievers and outlining future work that applies CARE to outputs from BM25 and dense retrievers on the same datasets. revision: partial
Circularity Check
No circularity: empirical comparison of LLM judges on public datasets
full rationale
The paper conducts an empirical study comparing three LLM-as-judge strategies (including the proposed CARE) for multi-hop RAG evaluation on HotPotQA, MuSiQue, and SQuAD. It simulates retrieval by injecting dataset-provided supporting facts and measures performance via LLM scoring of collective entailment. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claim (CARE outperforms baselines) is a direct experimental outcome on fixed public data rather than a reduction to its own inputs by construction. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925
Afzal, A., Kowsik, A., Fani, R., Matthes, F.: Towards optimizing and evaluating a retrieval augmented QA chatbot using LLMs with human in the loop (2024). https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925
-
[2]
AI@Meta: Llama 3.1 model card (2024), https://github.com/meta-llama/llama- models/blob/main/models/llama3_1/MODEL_CARD.md
2024
-
[3]
https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458
Alinejad, A., Kumar, K., Vahdat, A.: Evaluating the retrieval component in LLM-based question answering systems (2024). https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458
-
[4]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025
Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., Li, J.: LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks (2025). https://doi.org/10.48550/arXiv.2412.15204, http://arxiv.org/abs/2412.15204
-
[5]
https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066
Brehme, L., Dornauer, B., Ströhle, T., Ehrhart, M., Breu, R.: Retrieval- augmented generation in industry: An interview study on use cases, require- ments, challenges, and evaluation. https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066
-
[6]
Brehme, L., Ströhle, T., Breu, R.: Can LLMs be trusted for evaluating RAG systems? a survey of methods and datasets (2025). https://doi.org/10.1109/SDS66131.2025.00010, https://ieeexplore.ieee.org/document/11081490, ISSN: 2835-3420
-
[7]
Language Models are Few-Shot Learners
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...
work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2020
-
[8]
DeepMind, G.: Gemini models | gemini API (2025), https://ai.google.dev/gemini- api/docs/models
2025
-
[9]
arXiv preprint arXiv:2409.03759v1 , year=
Ding, T., Banerjee, A., Mombaerts, L., Li, Y., Borogovac, T., Weinstein, J.P.D.l.C.: VERA: Validation and evaluation of retrieval-augmented systems (2024). https://doi.org/10.48550/arXiv.2409.03759, http://arxiv.org/abs/2409.03759
-
[10]
Chapman and Hall/CRC, New York (1994)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hal- l/CRC (1994). https://doi.org/10.1201/9780429246593
-
[11]
arXiv preprint arXiv:2309.15217 (2023)
Es, S., James, J., Espinosa-Anke, L., Schockaert, S.: RAGAS: Automated evaluation of retrieval augmented generation (2023). https://doi.org/10.48550/arXiv.2309.15217, http://arxiv.org/abs/2309.15217
-
[12]
Friel, R., Belyi, M., Sanyal, A.: RAGBench: Explainable benchmark for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2407.11005, http://arxiv.org/abs/2407.11005 14 Brehme et al
-
[13]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-augmented generation for large lan- guage models: A survey (2024). https://doi.org/10.48550/arXiv.2312.10997, http://arxiv.org/abs/2312.10997
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
-
[14]
https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291
Kukreja, S., Kumar, T., Bharate, V., Purohit, A., Dasgupta, A., Guha, D.: Performance evaluation of vector embeddings with retrieval-augmented generation (2024). https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291
-
[15]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Apr 2021), http://arxiv.org/abs/2005.11401, arXiv:2005.11401
work page internal anchor Pith review arXiv 2021
-
[16]
Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024
Li, T., Zhang, G., Do, Q.D., Yue, X., Chen, W.: Long-context LLMs struggle with long in-context learning (2024). https://doi.org/10.48550/arXiv.2404.02060, http://arxiv.org/abs/2404.02060
-
[17]
arXiv preprint arXiv:2410.12248 (2024)
Liu, J., Ding, R., Zhang, L., Xie, P., Huang, F.: CoFE-RAG: A com- prehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity (2024). https://doi.org/10.48550/arXiv.2410.12248, http://arxiv.org/abs/2410.12248
-
[18]
Lorenz Brehme, Thomas Ströhle, R.B.: General·lorenzbrehme/CARE, https://github.com/lorenzbrehme/CARE
-
[19]
https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691
Moreira, G.d.S.P., Ak, R., Schifferer, B., Xu, M., Osmulski, R., Oldridge, E.: Enhancing q&a text retrieval with ranking models: Benchmarking, fine-tuning and deploying rerankers for RAG. https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691
-
[20]
OpenAI: Model - OpenAI API (2025), https://platform.openai.com
2025
-
[21]
https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783
Rackauckas, Z., Câmara, A., Zavrel, J.: Evaluating RAG- fusion with RAGElo: an automated elo-based framework (2024). https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783
-
[22]
Know What You Don't Know: Unanswerable Questions for SQuAD
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswer- able questions for SQuAD (2024). https://doi.org/10.48550/arXiv.1806.03822, http://arxiv.org/abs/1806.03822
-
[23]
preprint arXiv:2311.09476 (2023)
Saad-Falcon, J., Khattab, O., Potts, C., Zaharia, M.: ARES: An auto- mated evaluation framework for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2311.09476, http://arxiv.org/abs/2311.09476
-
[24]
https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781
Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval- augmented generation (2024). https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781
-
[25]
Tang, Y., Yang, Y.: MultiHop-RAG: Benchmarking retrieval-augmented gener- ation for multi-hop queries (2024). https://doi.org/10.48550/arXiv.2401.15391, http://arxiv.org/abs/2401.15391
-
[26]
Transactions of the Association for Computational Linguistics (2022)
Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (2022)
2022
-
[27]
In: Proceedings of the 19th Australasian Doc- ument Computing Symposium
Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and lan- guage models examined. In: Proceedings of the 19th Australasian Doc- ument Computing Symposium. pp. 58–65. ADCS ’14, Association for Computing Machinery (2014). https://doi.org/10.1145/2682862.2682863, https://dl.acm.org/doi/10.1145/2682862.2682863 Multi-Hop Retriever Evaluation 15
-
[28]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023). https://doi.org/10.48550/arXiv.2201.11903, http://arxiv.org/abs/2201.11903
work page internal anchor Pith review doi:10.48550/arxiv.2201.11903 2023
-
[29]
Jason Weston and Sainbayar Sukhbaatar
Wu, N., Gong, M., Shou, L., Liang, S., Jiang, D.: Large language models are diverse role-players for summarization evaluation (2023). https://doi.org/10.48550/arXiv.2303.15078, http://arxiv.org/abs/2303.15078
-
[30]
https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212
Xu, L., Lian, J., Zhao, W.X., Gong, M., Shou, L., Jiang, D., Xie, X., Wen, J.R.: Negative sampling for contrastive representation learning: A review. https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212
-
[31]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdi- nov, R., Manning, C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answering (2018). https://doi.org/10.48550/arXiv.1809.09600, http://arxiv.org/abs/1809.09600
work page internal anchor Pith review doi:10.48550/arxiv.1809.09600 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.