pith. machine review for the scientific record. sign in

arxiv: 2603.27253 · v2 · submitted 2026-03-28 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords RAGhallucination mitigationensemble votingretrieval-augmented generationLLM reliabilityvoting mechanismparallel agents
0
0 comments X

The pith

VOTE-RAG reduces compounded RAG hallucinations by voting across multiple retrieval queries and independent answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation can produce worse errors when flawed retrieved documents steer the model into further mistakes. VOTE-RAG counters this with a training-free two-stage process: multiple agents first issue diverse queries in parallel and pool the returned documents, then multiple agents generate answers from that pool and the system outputs the majority choice. Experiments across six benchmark datasets show the method performs at least as well as more elaborate frameworks while remaining simpler and fully parallelizable. The design also sidesteps problem drift because it never iteratively alters the original query.

Core claim

VOTE-RAG is a two-stage ensemble voting framework that first aggregates documents through parallel retrieval voting with diverse queries and then resolves answers through response voting by majority among independently generated outputs, achieving performance comparable to or surpassing more complex frameworks on six benchmark datasets.

What carries the argument

Two-stage voting mechanism: retrieval voting pools documents from multiple parallel diverse queries, followed by response voting that selects the majority answer from independent generations based on the pooled documents.

If this is right

  • Performance matches or exceeds that of more complex RAG frameworks on six standard benchmarks.
  • The architecture stays simpler and remains fully parallelizable, avoiding sequential refinement steps.
  • Problem drift risk disappears because the original query is never altered during the process.
  • No training or fine-tuning is required, allowing direct use with existing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same voting pattern could be applied to reduce other LLM consistency failures outside retrieval settings.
  • Increasing the number of parallel agents may improve accuracy further provided compute budgets allow.
  • The method can wrap existing RAG pipelines with only minor changes to query and answer generation calls.

Load-bearing premise

Majority voting among independently generated responses will reliably select the correct answer when the retrieved documents contain misleading content that could prompt consistent hallucinations.

What would settle it

A controlled run on any of the six benchmarks in which a majority of agents produce the same incorrect answer while a minority produces the correct one, causing the final vote to select the error.

Figures

Figures reproduced from arXiv: 2603.27253 by Zequn Xie, Zhengyang Sun.

Figure 1
Figure 1. Figure 1: An overview of our VOTE-RAG framework. It leverages parallel ensemble voting to enhance retrieval breadth and generation robustness against [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes VOTE-RAG, a training-free two-stage ensemble framework to mitigate 'hallucination on hallucination' in RAG. Stage 1 (Retrieval Voting) generates diverse queries in parallel and aggregates retrieved documents; Stage 2 (Response Voting) has multiple agents independently generate answers from the aggregated context and selects the majority-vote output. The central claim is that this achieves performance comparable to or better than more complex methods on six benchmarks while remaining simpler, fully parallelizable, and free of problem-drift risk.

Significance. If the empirical claims hold with proper controls and statistics, the work would demonstrate that lightweight, training-free majority voting can reliably outperform or match elaborate RAG variants, offering a practical, scalable baseline for hallucination mitigation that emphasizes simplicity and reproducibility.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: comparative results on six benchmarks are asserted without any reported metrics, baselines, error bars, statistical significance tests, or per-dataset tables, so the performance claim cannot be evaluated and is load-bearing for the entire contribution.
  2. [Section 3.2] Response Voting description (Section 3.2): the mechanism assumes the correct answer remains the mode even when retrieval contains misleading documents, yet no agreement rates, tie-resolution procedure, number of agents, or error-case analysis (e.g., instances where consistent hallucinations outvote the truth) are provided; this directly tests the core 'hallucination on hallucination' mitigation hypothesis.
  3. [Section 3 / Experiments] Method and Experiments: no specification of how query diversity is generated, how many agents are used, or how the aggregated document set is truncated, all of which affect both the parallelizability claim and the reproducibility of the reported gains.
minor comments (1)
  1. [Abstract] The phrase 'problem drift' risk is placed in quotes in the abstract but never defined or contrasted with the proposed method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important gaps in the presentation of our results and implementation details. We will revise the manuscript to address each point, adding the necessary tables, specifications, and analyses to strengthen the empirical support and reproducibility of VOTE-RAG.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: comparative results on six benchmarks are asserted without any reported metrics, baselines, error bars, statistical significance tests, or per-dataset tables, so the performance claim cannot be evaluated and is load-bearing for the entire contribution.

    Authors: We agree that the experimental claims require fuller documentation to be evaluable. In the revised manuscript we will insert a main results table (and appendix tables) reporting exact metrics for VOTE-RAG and every baseline on each of the six datasets, include standard-error bars from multiple runs, and add paired statistical significance tests (e.g., McNemar or t-tests) against the strongest baselines. revision: yes

  2. Referee: [Section 3.2] Response Voting description (Section 3.2): the mechanism assumes the correct answer remains the mode even when retrieval contains misleading documents, yet no agreement rates, tie-resolution procedure, number of agents, or error-case analysis (e.g., instances where consistent hallucinations outvote the truth) are provided; this directly tests the core 'hallucination on hallucination' mitigation hypothesis.

    Authors: We will expand Section 3.2 with the missing specifications: number of response agents (5), tie-resolution rule (highest average token-level confidence, else random among tied answers), and per-instance agreement rates. We will also add a dedicated error-analysis subsection that quantifies cases in which consistent hallucinations outvote the correct answer, thereby directly testing the core hypothesis. revision: yes

  3. Referee: [Section 3 / Experiments] Method and Experiments: no specification of how query diversity is generated, how many agents are used, or how the aggregated document set is truncated, all of which affect both the parallelizability claim and the reproducibility of the reported gains.

    Authors: We accept that these implementation details are required for reproducibility. The revision will specify: (i) query diversity generation via prompt paraphrasing and temperature sampling, (ii) exact agent counts (4 for retrieval voting, 5 for response voting), and (iii) truncation of the aggregated document pool to the top-10 unique passages after deduplication. These additions will also clarify the parallel execution schedule. revision: yes

Circularity Check

0 steps flagged

No circularity: VOTE-RAG is a direct procedural ensemble method with no derivation chain

full rationale

The paper describes VOTE-RAG as a training-free, two-stage procedural framework consisting of parallel query generation for retrieval aggregation followed by independent response generation and majority voting. No equations, fitted parameters, ansatzes, or first-principles derivations are present. Performance claims rest on empirical benchmark comparisons rather than any 'prediction' that reduces to the method's own inputs by construction. No self-citations, uniqueness theorems, or load-bearing references to prior author work are invoked to justify core steps. The central mechanism (majority vote over LLM outputs) is presented as an explicit algorithmic choice, not derived from or equivalent to its own outputs. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that independent generations will produce a detectable majority for correct answers even under noisy retrieval; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Majority vote among independent generations selects the non-hallucinated answer when retrieval is flawed.
    Invoked in the description of Response Voting stage; no empirical validation or proof supplied in abstract.

pith-pipeline@v0.9.0 · 5482 in / 1217 out tokens · 34668 ms · 2026-05-14T22:41:13.510030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic Retrieval-Augmented Generation for Financial Document Question Answering

    cs.AI 2026-05 unverdicted novelty 6.0

    FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://arxiv.org/abs/2303.08774

  2. [2]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. [Online]. Available: https://arxiv.org/abs/2307.09288

  3. [3]

    Towards transparent ai: A survey on explainable large language models,

    A. Palikhe, Z. Yu, Z. Wang, and W. Zhang, “Towards transparent ai: A survey on explainable large language models,”arXiv preprint arXiv:2506.21812, 2025

  4. [4]

    ACM Comput

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. [Online]. Available: https://dl.acm.org/doi/10.1145/3571730

  5. [5]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on Information Systems, 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3703155

  6. [6]

    Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,

    Z. Yu, M. Y . I. Idris, P. Wang, and R. Qureshi, “Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,”AAAI 2026, vol. 40, no. 48, pp. 41 455–41 456, 2026

  7. [7]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997

  8. [8]

    Removal of hallucination on hallucination: Debate-augmented RAG,

    W. Hu, W. Zhang, Y . Jiang, C. J. Zhang, X. Wei, and L. Qing, “Removal of hallucination on hallucination: Debate-augmented RAG,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 15 839–15 853. [Online]. Available: ht...

  9. [9]

    Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

    A. Sarkar, M. Y . I. Idris, and Z. Yu, “Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,”arXiv preprint arXiv:2508.10523, 2025

  10. [10]

    Chat-driven text generation and interaction for person retrieval,

    Z. Xie, C. Wang, Y . Wang, S. Cai, S. Wang, and T. Jin, “Chat-driven text generation and interaction for person retrieval,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 5259–5270

  11. [11]

    Conquer: Context-aware representation with query enhancement for text-based person search,

    Z. Xie, “Conquer: Context-aware representation with query enhancement for text-based person search,”arXiv preprint arXiv:2601.18625, 2026

  12. [12]

    Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,

    Z. Yu and C. S. Chan, “Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,” AAAI 2025, vol. 39, no. 9, pp. 9716–9724, 2025

  13. [13]

    Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,

    Z. Yu, J. Wang, H. Chen, and M. Y . I. Idris, “Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,”IEEE Access, 2025

  14. [14]

    Hvd: Human vision- driven video representation learning for text-video retrieval,

    Z. Xie, X. Liu, B. Zhang, Y . Lin, S. Cai, and T. Jin, “Hvd: Human vision- driven video representation learning for text-video retrieval,”arXiv preprint arXiv:2601.16155, 2026

  15. [15]

    Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,

    Z. Xie, B. Zhang, Y . Lin, and T. Jin, “Delving deeper: Hierarchi- cal visual perception for robust video-text retrieval,”arXiv preprint arXiv:2601.12768, 2026

  16. [16]

    Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions,

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 10 0...

  17. [17]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

    Z. Shao, Y . Gong, Y . Shen, M. Huang, N. Duan, and W. Chen, “Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 9248–9274. [Onli...

  18. [18]

    Retrieval-generation synergy augmented large language models,

    Z. Feng, X. Feng, D. Zhao, M. Yang, and B. Qin, “Retrieval-generation synergy augmented large language models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 661–11 665. [Online]. Available: https://ieeexplore.ieee.org/document/10448015

  19. [19]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self- rag: Learning to retrieve, generate, and critique through self- reflection,”arXiv preprint arXiv:2310.11511, 2023. [Online]. Available: https://arxiv.org/abs/2310.11511

  20. [20]

    Forgetme: Benchmarking the selective forgetting capabilities of generative models,

    Z. Yu, M. Y . I. Idris, P. Wang, Y . Xia, and Y . Xiang, “Forgetme: Benchmarking the selective forgetting capabilities of generative models,” EAAI, vol. 161, p. 112087, 2025

  21. [21]

    Debate or vote: Which yields better decisions in multi-agent large language models?

    H. K. Choi, X. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?” inAdvances in Neural Information Processing Systems, 2025

  22. [22]

    Spatiotemporal align- ment for remote sensing image recovery via terrain-aware diffusion,

    Z. Yu, H. Jiang, P. Wang, Z. Lin, and Y . Xiang, “Spatiotemporal align- ment for remote sensing image recovery via terrain-aware diffusion,” ICASSP 2026, 2026

  23. [23]

    Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,

    Z. Yu, M. Y . I. IDRIS, P. Wang, and R. Qureshi, “Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,” inNeurIPS 2025, 2025

  24. [24]

    Retrieval augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” inInternational conference on machine learning. PMLR, 2020, pp. 3929–3938. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3524938.3525306

  25. [25]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401

  26. [26]

    Leveraging passage retrieval with generative models for open domain question answering,

    G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 874–8...

  27. [27]

    Few-shot learning with retrieval augmented language models,

    G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot learning with retrieval augmented language models,” arXiv preprint arXiv:2208.03299, 2022. [Online]. Available: https://arxiv.org/abs/2208.03299

  28. [28]

    REPLUG: Retrieval-augmented black- box language models,

    W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “REPLUG: Retrieval-augmented black- box language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Me...

  29. [29]

    Fusing semantics, observability, reliability and diversity of concept detectors for video search,

    X.-Y . Wei and C.-W. Ngo, “Fusing semantics, observability, reliability and diversity of concept detectors for video search,” inProceedings of the 16th ACM international conference on Multimedia, 2008, pp. 81–90. [Online]. Available: https://dl.acm.org/doi/10.1145/1459359.1459371

  30. [30]

    Multi-agent large language models for conversational task- solving,

    J. Becker, “Multi-agent large language models for conversational task- solving,”arXiv preprint arXiv:2410.22932, 2024. [Online]. Available: https://arxiv.org/abs/2410.22932

  31. [31]

    Finecir: Explicit parsing of fine-grained modification semantics for composed image retrieval,

    Z. Li, Z. Fu, Y . Hu, Z. Chen, H. Wen, and L. Nie, “Finecir: Explicit parsing of fine-grained modification semantics for composed image retrieval,”https://arxiv.org/abs/2503.21309, 2025

  32. [32]

    Encoder: Entity mining and modification relation binding for composed image retrieval,

    Z. Li, Z. Chen, H. Wen, Z. Fu, Y . Hu, and W. Guan, “Encoder: Entity mining and modification relation binding for composed image retrieval,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 5101–5109

  33. [33]

    Hud: Hierar- chical uncertainty-aware disambiguation network for composed video retrieval,

    Z. Chen, Y . Hu, Z. Li, Z. Fu, H. Wen, and W. Guan, “Hud: Hierar- chical uncertainty-aware disambiguation network for composed video retrieval,” inProceedings of the ACM International Conference on Multimedia, 2025, p. 6143–6152

  34. [34]

    Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval,

    Z. Chen, Y . Hu, Z. Fu, Z. Li, J. Huang, Q. Huang, and Y . Wei, “Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 463–20 471

  35. [35]

    Offset: Segmentation-based focus shift revision for composed image retrieval,

    Z. Chen, Y . Hu, Z. Li, Z. Fu, X. Song, and L. Nie, “Offset: Segmentation-based focus shift revision for composed image retrieval,” inProceedings of the ACM International Conference on Multimedia, 2025, p. 6113–6122

  36. [36]

    Refine: Composed video retrieval via shared and differential semantics enhance- ment,

    Y . Hu, Z. Li, Z. Chen, Q. Huang, Z. Fu, M. Xu, and L. Nie, “Refine: Composed video retrieval via shared and differential semantics enhance- ment,”ACM Transactions on Multimedia Computing, Communications and Applications, 2026

  37. [37]

    Active retrieval augmented generation,

    Z. Jiang, F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7969–7992. [Online]. Avail...

  38. [38]

    Natural questions: A benchmark for question answering research,

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: A benchmark for question answering research,”Transactions of the Association for Computational Linguistics, vol. 7, pp....

  39. [39]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,

    M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y . Kan, Eds. Vancouver, Canada: Association for Computational Linguistics, Jul. 20...

  40. [40]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,

    A. Mallen, A. Asai, V . Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canad...

  41. [41]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,

    X. Ho, A.-K. Duong Nguyen, S. Sugawara, and A. Aizawa, “Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,” inProceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 66...

  42. [42]

    HotpotQA: A dataset for diverse, explainable multi-hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “HotpotQA: A dataset for diverse, explainable multi-hop question answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Brussels, Belgium: Association for Computationa...

  43. [43]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies,

    M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, “Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.21/

  44. [44]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024. [Online]. Available: https://dl.acm.org/doi/10.5555/3692070.3692537

  45. [45]

    Sure: Summarizing retrievals using answer candidates for open-domain qa of llms,

    J. Kim, J. Nam, S. Mo, J. Park, S.-W. Lee, M. Seo, J.-W. Ha, and J. Shin, “Sure: Summarizing retrievals using answer candidates for open-domain qa of llms,”arXiv preprint arXiv:2404.13081, 2024. [Online]. Available: https://arxiv.org/abs/2404.13081

  46. [46]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  47. [47]

    Flashrag: A modular toolkit for efficient retrieval-augmented generation research,

    J. Jin, Y . Zhu, X. Yang, C. Zhang, and Z. Dou, “Flashrag: A modular toolkit for efficient retrieval-augmented generation research,”CoRR, vol. abs/2405.13576, 2024. [Online]. Available: https://arxiv.org/abs/2405.13576

  48. [48]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022. [Online]. Available: https://arxiv.org/abs/2212.03533