pith. sign in

arxiv: 2409.10102 · v2 · pith:34JQ4IKWnew · submitted 2024-09-16 · 💻 cs.IR · cs.AI· cs.CL

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

Pith reviewed 2026-05-23 21:06 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords retrieval-augmented generationtrustworthinesslarge language modelsbenchmarkfactualityRAG systemstrust framework
0
0 comments X

The pith

Trust-RAG Compass framework assesses RAG system trustworthiness on six dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation improves large language models by grounding outputs in external knowledge, but trustworthiness issues persist from unreliable retrieval. This paper proposes the Trust-RAG Compass as a unified framework to evaluate RAG systems along factuality, robustness, fairness, transparency, accountability, and privacy. It reviews literature for each dimension and introduces the TRC Bench benchmark to test various models. Evaluations highlight performance differences between proprietary and open-source LLMs. The work outlines challenges and directions for building more trustworthy RAG systems.

Core claim

We propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench, regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness.

What carries the argument

Trust-RAG Compass framework, which structures trustworthiness assessment into the six dimensions and supports literature review plus benchmarking via TRC Bench.

If this is right

  • Literature on RAG trustworthiness can be organized along the six dimensions for structured analysis.
  • TRC Bench enables direct comparison of models on trustworthiness metrics.
  • Performance gaps appear between proprietary and open-source LLMs on the dimensions.
  • Key challenges identified can guide targeted improvements in RAG development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could apply the benchmark to diagnose and fix weaknesses in specific dimensions for their RAG deployments.
  • The framework structure might transfer to trustworthiness assessment in non-RAG LLM applications.
  • If new RAG risks appear, the dimensions could be revisited or expanded in follow-up work.

Load-bearing premise

The six dimensions comprehensively and without overlap capture all aspects of trustworthiness in RAG systems.

What would settle it

An empirical study that identifies a significant trustworthiness failure mode in RAG systems not covered by any of the six dimensions.

Figures

Figures reproduced from arXiv: 2409.10102 by Chaozhuo Li, Hongjin Qian, Jason Chen Zhang, Jiajie Jin, Jiaxin Mao, Jingying Shao, Philip S. Yu, Wenbo Zhang, Xiaoxi Li, Yan Liu, Yujia Zhou, Zheng Liu, Zhicheng Dou.

Figure 1
Figure 1. Figure 1: Six key dimensions of trustworthiness in Retrieval-Augmented Generation (RAG) systems. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The integration of six trustworthy RAG evaluation dimensions within the complete RAG framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Timeline of studies in trustworthy RAG across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The performance radar chart of various LLMs across [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Trust-RAG Compass, a unified framework for assessing trustworthiness of Retrieval-Augmented Generation (RAG) systems across six dimensions (factuality, robustness, fairness, transparency, accountability, privacy). It reviews the literature organized by these dimensions, introduces the TRC Bench evaluation benchmark covering the six dimensions, evaluates a range of proprietary and open-source LLMs on the benchmark, reports performance gaps, and outlines key challenges and future directions.

Significance. A well-justified taxonomy and benchmark could supply a needed organizing structure for trustworthiness research in RAG, moving beyond accuracy-focused evaluations and enabling systematic comparisons across model types.

major comments (1)
  1. [Framework introduction / §3] The manuscript presents the six dimensions of Trust-RAG Compass (factuality, robustness, fairness, transparency, accountability, privacy) as given in the abstract and framework definition without deriving them from a systematic enumeration of RAG failure modes or comparing the partition against plausible alternatives (e.g., addition of security or calibration). Because this choice structures the entire literature review and the construction of TRC Bench, the absence of explicit justification or validation is load-bearing for the central claim.
minor comments (2)
  1. [Framework section] Ensure that the definition of each dimension in the framework section is accompanied by a short list of concrete RAG-specific failure examples so readers can map the taxonomy to observed behaviors.
  2. [TRC Bench section] In the benchmark description, clarify how the six evaluation subsets were constructed and whether any overlap or redundancy between dimensions was measured.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the justification of the Trust-RAG Compass framework. We address the single major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Framework introduction / §3] The manuscript presents the six dimensions of Trust-RAG Compass (factuality, robustness, fairness, transparency, accountability, privacy) as given in the abstract and framework definition without deriving them from a systematic enumeration of RAG failure modes or comparing the partition against plausible alternatives (e.g., addition of security or calibration). Because this choice structures the entire literature review and the construction of TRC Bench, the absence of explicit justification or validation is load-bearing for the central claim.

    Authors: We agree that an explicit derivation from RAG failure modes would strengthen the framework's foundation. In the revised manuscript we will expand §3 with a new subsection that first enumerates representative RAG failure modes drawn from the surveyed literature (hallucinations and retrieval errors for factuality; adversarial retrieval attacks and distribution shifts for robustness; biased retrieval results for fairness; opaque retrieval-generation pipelines for transparency; lack of audit trails for accountability; and leakage of private retrieved content for privacy). We will then map each dimension to these modes and briefly compare the resulting partition against alternatives, noting that security concerns are largely subsumed under robustness and privacy while calibration issues fall under factuality. This addition will directly support the literature organization and TRC Bench construction without changing the six dimensions themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: framework organizes literature review without self-referential reduction

full rationale

The paper proposes Trust-RAG Compass as an organizing framework for a survey of existing RAG trustworthiness literature, listing the six dimensions directly in the abstract and stating that the review proceeds 'within this framework.' No equations, fitted parameters, or self-citations are invoked to derive the dimensions; the structure is presented as a proposed taxonomy drawn from the reviewed works rather than reducing to any input by construction. The central claims (literature organization and TRC Bench) therefore remain independent of the framework definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that proposes a conceptual framework and benchmark rather than a mathematical derivation; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5805 in / 1213 out tokens · 28371 ms · 2026-05-23T21:06:17.119181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Retrieval-Augmented Generation Fails: A Graph Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

  2. When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

    cs.AI 2026-02 unverdicted novelty 6.0

    Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.

  3. Search-o1: Agentic Search-Enhanced Large Reasoning Models

    cs.AI 2025-01 unverdicted novelty 6.0

    Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...

  4. ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation

    cs.IR 2026-04 unverdicted novelty 5.0

    ALDEN boosts private data extraction rates from RAG systems by combining active learning for query diversification with dynamic estimation of the underlying knowledge-base topic distribution.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 4 Pith papers · 9 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023

  2. [2]

    Exploring the limits of transfer learning with a unified text-to- text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P . J. Liu, “Exploring the limits of transfer learning with a unified text-to- text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

  3. [3]

    Leveraging passage re- trieval with generative models for open domain ques- tion answering,

    G. Izacard and E. Grave, “Leveraging passage re- trieval with generative models for open domain ques- tion answering,” in EACL. Association for Computa- tional Linguistics, 2021, pp. 874–880

  4. [4]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saun- ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021. 17

  5. [5]

    A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,

    Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V . Do, Y. Xu, and P . Fung, “A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,” inIJCNLP (1). Association for Computational Linguistics, 2023, pp. 675–718

  6. [6]

    Unsupervised real-time hallucination detec- tion based on the internal states of large language models,

    W. Su, C. Wang, Q. Ai, Y. Hu, Z. Wu, Y. Zhou, and Y. Liu, “Unsupervised real-time hallucination detec- tion based on the internal states of large language models,” in ACL (Findings) . Association for Com- putational Linguistics, 2024, pp. 14 379–14 391

  7. [7]

    Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,

    Y. Li, M. Du, R. Song, X. Wang, M. Sun, and Y. Wang, “Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,” Artificial Intelligence, vol. 332, p. 104143, 2024

  8. [8]

    Merging generated and retrieved knowledge for open-domain QA,

    Y. Zhang, M. Khalifa, L. Logeswaran, M. Lee, H. Lee, and L. Wang, “Merging generated and retrieved knowledge for open-domain QA,” in EMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 4710– 4728

  9. [9]

    S. Pal, M. Bhattacharya, M. A. Islam, and C. Chakraborty, “Chatgpt or llm in next-generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artifi- cial intelligence-based device for a faster way of drug discovery and development,” International Journal of Surgery, vol. 109, no. 12, pp. 4382–4384, 2023

  10. [10]

    REPLUG: Retrieval-Augmented Black-Box Language Models

    W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. Yih, “REPLUG: retrieval-augmented black-box language models,” CoRR, vol. abs/2301.12652, 2023

  11. [11]

    Benchmarking large language models in retrieval-augmented gener- ation,

    J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented gener- ation,” in AAAI. AAAI Press, 2024, pp. 17 754–17 762

  12. [12]

    Revolutionizing finance with llms: An overview of applications and insights,

    H. Zhao, Z. Liu, Z. Wu, Y. Li, T. Yang, P . Shu, S. Xu, H. Dai, L. Zhao, G. Maiet al., “Revolutionizing finance with llms: An overview of applications and insights,” arXiv preprint arXiv:2401.11641, 2024

  13. [13]

    Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,

    A. Ghosh, A. Acharya, R. Jain, S. Saha, A. Chadha, and S. Sinha, “Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,” in Proceedings of the AAAI Conference on Artificial In- telligence, vol. 38, no. 20, 2024, pp. 22 031–22 039

  14. [15]

    Selecmix: Debiased learning by contradicting-pair sampling,

    I. Hwang, S. Lee, Y. Kwak, S. J. Oh, D. Teney, J.-H. Kim, and B.-T. Zhang, “Selecmix: Debiased learning by contradicting-pair sampling,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 345– 14 357, 2022

  15. [16]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algo- rithms,” CoRR, vol. abs/1707.06347, 2017

  16. [17]

    Ew-tune: A framework for privately fine-tuning large language models with differential privacy,

    R. Behnia, M. Ebrahimi, J. Pacheco, and B. Padmanab- han, “Ew-tune: A framework for privately fine-tuning large language models with differential privacy,” in ICDM (Workshops). IEEE, 2022, pp. 560–566

  17. [18]

    On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,

    T. Y. Zhuo, Z. Li, Y. Huang, F. Shiri, W. Wang, G. Haffari, and Y. Li, “On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,” in EACL. Association for Computational Linguistics, 2023, pp. 1090–1102

  18. [19]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Y. Liu, Y. Yao, J. Ton, X. Zhang, R. Guo, H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trust- worthy llms: a survey and guideline for evaluat- ing large language models’ alignment,” CoRR, vol. abs/2308.05374, 2023

  19. [20]

    TrustLLM: Trustworthiness in Large Language Models

    L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, Z. Liu, Y. Liu, Y. Wang, Z. Zhang, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P . Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang, J. Wang, J. Mitchell, K. Shu,...

  20. [21]

    Atlas: Few-shot learning with retrieval augmented language models,

    G. Izacard, P . S. H. Lewis, M. Lomeli, L. Hos- seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” J. Mach. Learn. Res., vol. 24, pp. 251:1–251:43, 2023

  21. [22]

    Ragbench: Explain- able benchmark for retrieval-augmented generation systems,

    R. Friel, M. Belyi, and A. Sanyal, “Ragbench: Explain- able benchmark for retrieval-augmented generation systems,” 2024

  22. [23]

    Rag- ex: A generic framework for explaining retrieval aug- mented generation,

    V . Sudhi, S. R. Bhat, M. Rudat, and R. Teucher, “Rag- ex: A generic framework for explaining retrieval aug- mented generation,” in SIGIR. ACM, 2024, pp. 2776– 2780

  23. [24]

    Fairrag: Fair human generation via fair retrieval aug- mentation,

    R. Shrestha, Y. Zou, Q. Chen, Z. Li, Y. Xie, and S. Deng, “Fairrag: Fair human generation via fair retrieval aug- mentation,” CoRR, vol. abs/2403.19964, 2024

  24. [25]

    Active retrieval augmented generation,

    Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” in EMNLP. Association for Computational Linguistics, 2023, pp. 7969–7992

  25. [26]

    Retrieval- augmented generation for knowledge-intensive NLP tasks,

    P . S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval- augmented generation for knowledge-intensive NLP tasks,” in NeurIPS, 2020

  26. [27]

    Retrieval augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P . Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in ICML, ser. Proceedings of Machine Learning Re- search, vol. 119. PMLR, 2020, pp. 3929–3938

  27. [28]

    Improving language models by retrieving from trillions of to- 18 kens,

    S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa- ganini, G. Irving, O. Vinyals, S. Osindero, K. Si- monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improving lang...

  28. [29]

    Generalization through memoriza- tion: Nearest neighbor language models,

    U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memoriza- tion: Nearest neighbor language models,” in 8th In- ternational Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . Open- Review.net, 2020

  29. [30]

    Chain-of- thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” in NeurIPS, 2022

  30. [31]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” in NeurIPS, 2023

  31. [32]

    Self- consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self- consistency improves chain of thought reasoning in language models,” in ICLR. OpenReview.net, 2023

  32. [33]

    Large language models can be easily distracted by irrelevant context,

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 31 210–31 227

  33. [34]

    Take a step back: Evoking reasoning via abstraction in large language models,

    H. S. Zheng, S. Mishra, X. Chen, H. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” CoRR, vol. abs/2310.06117, 2023

  34. [35]

    Promptagator: Few-shot dense retrieval from 8 ex- amples,

    Z. Dai, V . Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Promptagator: Few-shot dense retrieval from 8 ex- amples,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  35. [36]

    Query rewrit- ing for retrieval-augmented large language models,

    X. Ma, Y. Gong, P . He, H. Zhao, and N. Duan, “Query rewriting for retrieval-augmented large lan- guage models,” CoRR, vol. abs/2305.14283, 2023

  36. [37]

    How context af- fects language models’ factual predictions,

    F. Petroni, P . S. H. Lewis, A. Piktus, T. Rockt ¨aschel, Y. Wu, A. H. Miller, and S. Riedel, “How context af- fects language models’ factual predictions,” in AKBC, 2020

  37. [38]

    Re2g: Retrieve, rerank, generate,

    M. R. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P . Cai, and A. Gliozzo, “Re2g: Retrieve, rerank, generate,” in NAACL-HLT. Association for Computational Linguistics, 2022, pp. 2701–2715

  38. [39]

    Walking down the memory maze: Beyond context limit through interactive reading,

    H. Chen, R. Pasunuru, J. Weston, and A. Celiky- ilmaz, “Walking down the memory maze: Beyond context limit through interactive reading,” CoRR, vol. abs/2310.05029, 2023

  39. [40]

    Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,

    J. Kim, J. Nam, S. Mo, J. Park, S. Lee, M. Seo, J. Ha, and J. Shin, “Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,” CoRR, vol. abs/2404.13081, 2024

  40. [41]

    RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,

    F. Xu, W. Shi, and E. Choi, “RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,” in ICLR. OpenRe- view.net, 2024

  41. [42]

    BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,

    J. Jin, Y. Zhu, Y. Zhou, and Z. Dou, “BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,” CoRR, vol. abs/2402.12174, 2024

  42. [43]

    PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,

    H. Yang, Z. Li, Y. Zhang, J. Wang, N. Cheng, M. Li, and J. Xiao, “PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,” inEMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 5364– 5375

  43. [44]

    Self-knowledge guided retrieval augmentation for large language models,

    Y. Wang, P . Li, M. Sun, and Y. Liu, “Self-knowledge guided retrieval augmentation for large language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 10 303–10 315

  44. [45]

    Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,

    S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. Park, “Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, ...

  45. [46]

    React: Synergizing reason- ing and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reason- ing and acting in language models,” in ICLR. Open- Review.net, 2023

  46. [47]

    Measuring and narrowing the com- positionality gap in language models,

    O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the com- positionality gap in language models,” in EMNLP (Findings). Association for Computational Linguis- tics, 2023, pp. 5687–5711

  47. [48]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab- harwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,” in ACL (1) . Association for Computational Linguistics, 2023, pp. 10 014–10 037

  48. [49]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection,

    A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” 2024

  49. [50]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” in NeurIPS, 2023

  50. [51]

    The role of chatgpt in scientific communication: writing better scientific review arti- cles,

    J. Huang and M. Tan, “The role of chatgpt in scientific communication: writing better scientific review arti- cles,” American journal of cancer research , vol. 13, no. 4, p. 1148, 2023

  51. [52]

    When llm-based code genera- tion meets the software development process,

    F. Lin, D. J. Kim et al., “When llm-based code genera- tion meets the software development process,” arXiv preprint arXiv:2403.15852, 2024

  52. [53]

    From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?

    T. Feng, L. Qu, N. Tandon, Z. Li, X. Kang, and G. Haffari, “From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?” arXiv preprint arXiv:2407.19638, 2024

  53. [54]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al. , “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv 19 preprint arXiv:2311.05232, 2023

  54. [55]

    Llm-driven robots risk enacting discrimination, violence, and unlawful actions,

    R. Azeem, A. Hundt, M. Mansouri, and M. Brand ˜ao, “Llm-driven robots risk enacting discrimination, violence, and unlawful actions,” arXiv preprint arXiv:2406.08824, 2024

  55. [56]

    On protecting the data privacy of large language models (llms): A survey,

    B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, and X. Cheng, “On protecting the data privacy of large language models (llms): A survey,” arXiv preprint arXiv:2403.05156, 2024

  56. [57]

    A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

    F. Wu, N. Zhang, S. Jha, P . McDaniel, and C. Xiao, “A new era in llm security: Exploring security con- cerns in real-world llm-based systems,” arXiv preprint arXiv:2402.18649, 2024

  57. [58]

    SAIL: search-augmented instruction learning,

    H. Luo, Y. Chuang, Y. Gong, T. Zhang, Y. Kim, X. Wu, D. Fox, H. Meng, and J. R. Glass, “SAIL: search-augmented instruction learning,” CoRR, vol. abs/2305.15225, 2023

  58. [59]

    Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

    B. Peng, M. Galley, P . He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and auto- mated feedback,” CoRR, vol. abs/2302.12813, 2023

  59. [60]

    Generate rather than retrieve: Large language models are strong context generators,

    W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang, “Generate rather than retrieve: Large language models are strong context generators,” in ICLR. OpenReview.net, 2023

  60. [61]

    Recall: A benchmark for llms robustness against external counterfactual knowledge,

    Y. Liu, L. Huang, S. Li, S. Chen, H. Zhou, F. Meng, J. Zhou, and X. Sun, “RECALL: A benchmark for llms robustness against external counterfactual knowl- edge,” CoRR, vol. abs/2311.08147, 2023

  61. [62]

    On the risk of misinformation pollution with large language models,

    Y. Pan, L. Pan, W. Chen, P . Nakov, M. Kan, and W. Y. Wang, “On the risk of misinformation pollution with large language models,” in EMNLP (Findings) . Association for Computational Linguistics, 2023, pp. 1389–1403

  62. [63]

    Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

    S. Cho, S. Jeong, J. Seo, T. Hwang, and J. C. Park, “Ty- pos that broke the rag’s back: Genetic attack on RAG pipeline by simulating documents in the wild via low- level perturbations,” CoRR, vol. abs/2404.13948, 2024

  63. [64]

    Poi- soning retrieval corpora by injecting adversarial pas- sages,

    Z. Zhong, Z. Huang, A. Wettig, and D. Chen, “Poi- soning retrieval corpora by injecting adversarial pas- sages,” in EMNLP. Association for Computational Linguistics, 2023, pp. 13 764–13 775

  64. [65]

    Attacking open-domain question answering by injecting misin- formation,

    L. Pan, W. Chen, M. Kan, and W. Y. Wang, “Attacking open-domain question answering by injecting misin- formation,” in IJCNLP (1). Association for Computa- tional Linguistics, 2023, pp. 525–539

  65. [66]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

    S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” inAISec@CCS. ACM, 2023, pp. 79–90

  66. [67]

    Defending against disinformation attacks in open-domain question answering,

    O. Weller, A. Khan, N. Weir, D. J. Lawrie, and B. V . Durme, “Defending against disinformation attacks in open-domain question answering,” in EACL (2) . Association for Computational Linguistics, 2024, pp. 402–417

  67. [68]

    Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,

    G. Hong, J. Kim, J. Kang, S. Myaeng, and J. J. Whang, “Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,” CoRR, vol. abs/2305.01579, 2023

  68. [69]

    Certifiably robust rag against retrieval corruption

    C. Xiang, T. Wu, Z. Zhong, D. Wagner, D. Chen, and P . Mittal, “Certifiably robust rag against retrieval corruption,” arXiv preprint arXiv:2405.15556, 2024

  69. [70]

    Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,

    H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, R. Lai, Z. Cao, J.-Y. Nie, and J.-R. Wen, “Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,” CoRR, vol. abs/2304.04358, 2023

  70. [71]

    Search-in-the- chain: Towards accurate, credible and traceable large language models for knowledgeintensive tasks,

    S. Xu, L. Pang, H. Shen, X. Cheng, and T.- S. Chua, “Search-in-the-chain: Towards the accu- rate, credible and traceable content generation for complex knowledge-intensive tasks,” CoRR, vol. abs/2304.14732, 2023

  71. [72]

    Llatrieval: Llm-verified retrieval for verifiable gen- eration,

    X. Li, C. Zhu, L. Li, Z. Yin, T. Sun, and X. Qiu, “Llatrieval: Llm-verified retrieval for verifiable gen- eration,” CoRR, vol. abs/2311.07838, 2023

  72. [73]

    Effective large language model adaptation for improved grounding,

    X. Ye, R. Sun, S. ¨O. Arik, and T. Pfister, “Effective large language model adaptation for improved grounding,” CoRR, vol. abs/2311.09533, 2023

  73. [74]

    HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,

    Y. Fang, S. W. Thomas, and X. Zhu, “HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,” CoRR, vol. abs/2402.09390, 2024

  74. [75]

    Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,

    S. Xia, X. Wang, J. Liang, Y. Zhang, W. Zhou, J. Deng, F. Yu, and Y. Xiao, “Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,” arXiv preprint arXiv:2407.01796 , 2024

  75. [76]

    PURR: efficiently editing language model halluci- nations by denoising language model corruptions,

    A. Chen, P . Pasupat, S. Singh, H. Lee, and K. Guu, “PURR: efficiently editing language model halluci- nations by denoising language model corruptions,” CoRR, vol. abs/2305.14908, 2023

  76. [77]

    Citation- enhanced generation for llm-based chatbots,

    W. Li, J. Li, W. Ma, and Y. Liu, “Citation- enhanced generation for llm-based chatbots,” CoRR, vol. abs/2402.16063, 2024

  77. [78]

    Retrieving sup- porting evidence for generative question answering,

    S. Huo, N. Arabzadeh, and C. Clarke, “Retrieving sup- porting evidence for generative question answering,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023, pp. 11–20

  78. [79]

    Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

    W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models,” CoRR, vol. abs/2402.07867, 2024

  79. [80]

    Phantom: General trigger attacks on retrieval augmented language generation,

    H. Chaudhari, G. Severi, J. Abascal, M. Jagielski, C. A. Choquette-Choo, M. Nasr, C. Nita-Rotaru, and A. Oprea, “Phantom: General trigger attacks on retrieval augmented language generation,” arXiv preprint arXiv:2405.20485, 2024

  80. [81]

    Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,

    D. Pasquini, M. Strohmeier, and C. Troncoso, “Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,” CoRR, vol. abs/2403.03792, 2024

Showing first 80 references.