Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

Chaozhuo Li; Hongjin Qian; Jason Chen Zhang; Jiajie Jin; Jiaxin Mao; Jingying Shao; Philip S. Yu; Wenbo Zhang; Xiaoxi Li; Yan Liu

arxiv: 2409.10102 · v2 · pith:34JQ4IKWnew · submitted 2024-09-16 · 💻 cs.IR · cs.AI· cs.CL

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

Yujia Zhou , Wenbo Zhang , Jingying Shao , Yan Liu , Xiaoxi Li , Jiajie Jin , Hongjin Qian , Zheng Liu

show 5 more authors

Chaozhuo Li Jason Chen Zhang Zhicheng Dou Philip S. Yu Jiaxin Mao

This is my paper

Pith reviewed 2026-05-23 21:06 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords retrieval-augmented generationtrustworthinesslarge language modelsbenchmarkfactualityRAG systemstrust framework

0 comments

The pith

Trust-RAG Compass framework assesses RAG system trustworthiness on six dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation improves large language models by grounding outputs in external knowledge, but trustworthiness issues persist from unreliable retrieval. This paper proposes the Trust-RAG Compass as a unified framework to evaluate RAG systems along factuality, robustness, fairness, transparency, accountability, and privacy. It reviews literature for each dimension and introduces the TRC Bench benchmark to test various models. Evaluations highlight performance differences between proprietary and open-source LLMs. The work outlines challenges and directions for building more trustworthy RAG systems.

Core claim

We propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench, regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness.

What carries the argument

Trust-RAG Compass framework, which structures trustworthiness assessment into the six dimensions and supports literature review plus benchmarking via TRC Bench.

If this is right

Literature on RAG trustworthiness can be organized along the six dimensions for structured analysis.
TRC Bench enables direct comparison of models on trustworthiness metrics.
Performance gaps appear between proprietary and open-source LLMs on the dimensions.
Key challenges identified can guide targeted improvements in RAG development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could apply the benchmark to diagnose and fix weaknesses in specific dimensions for their RAG deployments.
The framework structure might transfer to trustworthiness assessment in non-RAG LLM applications.
If new RAG risks appear, the dimensions could be revisited or expanded in follow-up work.

Load-bearing premise

The six dimensions comprehensively and without overlap capture all aspects of trustworthiness in RAG systems.

What would settle it

An empirical study that identifies a significant trustworthiness failure mode in RAG systems not covered by any of the six dimensions.

Figures

Figures reproduced from arXiv: 2409.10102 by Chaozhuo Li, Hongjin Qian, Jason Chen Zhang, Jiajie Jin, Jiaxin Mao, Jingying Shao, Philip S. Yu, Wenbo Zhang, Xiaoxi Li, Yan Liu, Yujia Zhou, Zheng Liu, Zhicheng Dou.

**Figure 2.** Figure 2: The integration of six trustworthy RAG evaluation dimensions within the complete RAG framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Timeline of studies in trustworthy RAG across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The performance radar chart of various LLMs across [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey introduces a Trust-RAG Compass framework and TRC Bench but the six dimensions are presented without clear derivation or comparison to alternatives.

read the letter

The main thing your colleague should know is that this is a survey paper that introduces the Trust-RAG Compass framework with six dimensions and a benchmark called TRC Bench, along with some model evaluations. It aims to structure the emerging work on trustworthy RAG. The paper does a solid job reviewing literature across factuality, robustness, fairness, transparency, accountability, and privacy. The evaluations on various models point out performance differences, which gives concrete examples of where current systems fall short. Providing a benchmark is a positive step toward making trustworthiness measurable in practice. Where it is softer is in the justification for the specific six dimensions. The abstract presents them directly without showing a systematic process for selecting them or checking against other possible dimensions such as security or uncertainty calibration. If the paper lacks a section deriving these from RAG failure modes or comparing partitions, the framework's claim to be unified rests on an untested choice. The details of how TRC Bench was built and whether the results truly demonstrate the gaps would also need close look for reproducibility. This paper is for researchers and practitioners focused on reliable retrieval-augmented systems who want an organized view of the literature and a starting benchmark. It shows honest engagement with the topic by trying to create structure where there was none. It deserves a serious referee to help sharpen the framework and validate the benchmark. I would recommend sending it out for peer review rather than desk rejecting it.

Referee Report

1 major / 2 minor

Summary. The paper proposes Trust-RAG Compass, a unified framework for assessing trustworthiness of Retrieval-Augmented Generation (RAG) systems across six dimensions (factuality, robustness, fairness, transparency, accountability, privacy). It reviews the literature organized by these dimensions, introduces the TRC Bench evaluation benchmark covering the six dimensions, evaluates a range of proprietary and open-source LLMs on the benchmark, reports performance gaps, and outlines key challenges and future directions.

Significance. A well-justified taxonomy and benchmark could supply a needed organizing structure for trustworthiness research in RAG, moving beyond accuracy-focused evaluations and enabling systematic comparisons across model types.

major comments (1)

[Framework introduction / §3] The manuscript presents the six dimensions of Trust-RAG Compass (factuality, robustness, fairness, transparency, accountability, privacy) as given in the abstract and framework definition without deriving them from a systematic enumeration of RAG failure modes or comparing the partition against plausible alternatives (e.g., addition of security or calibration). Because this choice structures the entire literature review and the construction of TRC Bench, the absence of explicit justification or validation is load-bearing for the central claim.

minor comments (2)

[Framework section] Ensure that the definition of each dimension in the framework section is accompanied by a short list of concrete RAG-specific failure examples so readers can map the taxonomy to observed behaviors.
[TRC Bench section] In the benchmark description, clarify how the six evaluation subsets were constructed and whether any overlap or redundancy between dimensions was measured.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the justification of the Trust-RAG Compass framework. We address the single major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Framework introduction / §3] The manuscript presents the six dimensions of Trust-RAG Compass (factuality, robustness, fairness, transparency, accountability, privacy) as given in the abstract and framework definition without deriving them from a systematic enumeration of RAG failure modes or comparing the partition against plausible alternatives (e.g., addition of security or calibration). Because this choice structures the entire literature review and the construction of TRC Bench, the absence of explicit justification or validation is load-bearing for the central claim.

Authors: We agree that an explicit derivation from RAG failure modes would strengthen the framework's foundation. In the revised manuscript we will expand §3 with a new subsection that first enumerates representative RAG failure modes drawn from the surveyed literature (hallucinations and retrieval errors for factuality; adversarial retrieval attacks and distribution shifts for robustness; biased retrieval results for fairness; opaque retrieval-generation pipelines for transparency; lack of audit trails for accountability; and leakage of private retrieved content for privacy). We will then map each dimension to these modes and briefly compare the resulting partition against alternatives, noting that security concerns are largely subsumed under robustness and privacy while calibration issues fall under factuality. This addition will directly support the literature organization and TRC Bench construction without changing the six dimensions themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: framework organizes literature review without self-referential reduction

full rationale

The paper proposes Trust-RAG Compass as an organizing framework for a survey of existing RAG trustworthiness literature, listing the six dimensions directly in the abstract and stating that the review proceeds 'within this framework.' No equations, fitted parameters, or self-citations are invoked to derive the dimensions; the structure is presented as a proposed taxonomy drawn from the reviewed works rather than reducing to any input by construction. The central claims (literature organization and TRC Bench) therefore remain independent of the framework definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that proposes a conceptual framework and benchmark rather than a mathematical derivation; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5805 in / 1213 out tokens · 28371 ms · 2026-05-23T21:06:17.119181+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
cs.AI 2026-02 unverdicted novelty 6.0

Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation
cs.IR 2026-04 unverdicted novelty 5.0

ALDEN boosts private data extraction rates from RAG systems by combining active learning for query diversification with dynamic estimation of the underlying knowledge-base topic distribution.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 4 Pith papers · 9 internal anchors

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023

work page 2023
[2]

Exploring the limits of transfer learning with a unified text-to- text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P . J. Liu, “Exploring the limits of transfer learning with a unified text-to- text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

work page 2020
[3]

Leveraging passage re- trieval with generative models for open domain ques- tion answering,

G. Izacard and E. Grave, “Leveraging passage re- trieval with generative models for open domain ques- tion answering,” in EACL. Association for Computa- tional Linguistics, 2021, pp. 874–880

work page 2021
[4]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saun- ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021. 17

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,

Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V . Do, Y. Xu, and P . Fung, “A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,” inIJCNLP (1). Association for Computational Linguistics, 2023, pp. 675–718

work page 2023
[6]

Unsupervised real-time hallucination detec- tion based on the internal states of large language models,

W. Su, C. Wang, Q. Ai, Y. Hu, Z. Wu, Y. Zhou, and Y. Liu, “Unsupervised real-time hallucination detec- tion based on the internal states of large language models,” in ACL (Findings) . Association for Com- putational Linguistics, 2024, pp. 14 379–14 391

work page 2024
[7]

Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,

Y. Li, M. Du, R. Song, X. Wang, M. Sun, and Y. Wang, “Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,” Artificial Intelligence, vol. 332, p. 104143, 2024

work page 2024
[8]

Merging generated and retrieved knowledge for open-domain QA,

Y. Zhang, M. Khalifa, L. Logeswaran, M. Lee, H. Lee, and L. Wang, “Merging generated and retrieved knowledge for open-domain QA,” in EMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 4710– 4728

work page 2023
[9]

S. Pal, M. Bhattacharya, M. A. Islam, and C. Chakraborty, “Chatgpt or llm in next-generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artifi- cial intelligence-based device for a faster way of drug discovery and development,” International Journal of Surgery, vol. 109, no. 12, pp. 4382–4384, 2023

work page 2023
[10]

REPLUG: Retrieval-Augmented Black-Box Language Models

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. Yih, “REPLUG: retrieval-augmented black-box language models,” CoRR, vol. abs/2301.12652, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Benchmarking large language models in retrieval-augmented gener- ation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented gener- ation,” in AAAI. AAAI Press, 2024, pp. 17 754–17 762

work page 2024
[12]

Revolutionizing finance with llms: An overview of applications and insights,

H. Zhao, Z. Liu, Z. Wu, Y. Li, T. Yang, P . Shu, S. Xu, H. Dai, L. Zhao, G. Maiet al., “Revolutionizing finance with llms: An overview of applications and insights,” arXiv preprint arXiv:2401.11641, 2024

work page arXiv 2024
[13]

Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,

A. Ghosh, A. Acharya, R. Jain, S. Saha, A. Chadha, and S. Sinha, “Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,” in Proceedings of the AAAI Conference on Artificial In- telligence, vol. 38, no. 20, 2024, pp. 22 031–22 039

work page 2024
[15]

Selecmix: Debiased learning by contradicting-pair sampling,

I. Hwang, S. Lee, Y. Kwak, S. J. Oh, D. Teney, J.-H. Kim, and B.-T. Zhang, “Selecmix: Debiased learning by contradicting-pair sampling,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 345– 14 357, 2022

work page 2022
[16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algo- rithms,” CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Ew-tune: A framework for privately fine-tuning large language models with differential privacy,

R. Behnia, M. Ebrahimi, J. Pacheco, and B. Padmanab- han, “Ew-tune: A framework for privately fine-tuning large language models with differential privacy,” in ICDM (Workshops). IEEE, 2022, pp. 560–566

work page 2022
[18]

On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,

T. Y. Zhuo, Z. Li, Y. Huang, F. Shiri, W. Wang, G. Haffari, and Y. Li, “On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,” in EACL. Association for Computational Linguistics, 2023, pp. 1090–1102

work page 2023
[19]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Y. Liu, Y. Yao, J. Ton, X. Zhang, R. Guo, H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trust- worthy llms: a survey and guideline for evaluat- ing large language models’ alignment,” CoRR, vol. abs/2308.05374, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

TrustLLM: Trustworthiness in Large Language Models

L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, Z. Liu, Y. Liu, Y. Wang, Z. Zhang, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P . Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang, J. Wang, J. Mitchell, K. Shu,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Atlas: Few-shot learning with retrieval augmented language models,

G. Izacard, P . S. H. Lewis, M. Lomeli, L. Hos- seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” J. Mach. Learn. Res., vol. 24, pp. 251:1–251:43, 2023

work page 2023
[22]

Ragbench: Explain- able benchmark for retrieval-augmented generation systems,

R. Friel, M. Belyi, and A. Sanyal, “Ragbench: Explain- able benchmark for retrieval-augmented generation systems,” 2024

work page 2024
[23]

Rag- ex: A generic framework for explaining retrieval aug- mented generation,

V . Sudhi, S. R. Bhat, M. Rudat, and R. Teucher, “Rag- ex: A generic framework for explaining retrieval aug- mented generation,” in SIGIR. ACM, 2024, pp. 2776– 2780

work page 2024
[24]

Fairrag: Fair human generation via fair retrieval aug- mentation,

R. Shrestha, Y. Zou, Q. Chen, Z. Li, Y. Xie, and S. Deng, “Fairrag: Fair human generation via fair retrieval aug- mentation,” CoRR, vol. abs/2403.19964, 2024

work page arXiv 2024
[25]

Active retrieval augmented generation,

Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” in EMNLP. Association for Computational Linguistics, 2023, pp. 7969–7992

work page 2023
[26]

Retrieval- augmented generation for knowledge-intensive NLP tasks,

P . S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval- augmented generation for knowledge-intensive NLP tasks,” in NeurIPS, 2020

work page 2020
[27]

Retrieval augmented language model pre-training,

K. Guu, K. Lee, Z. Tung, P . Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in ICML, ser. Proceedings of Machine Learning Re- search, vol. 119. PMLR, 2020, pp. 3929–3938

work page 2020
[28]

Improving language models by retrieving from trillions of to- 18 kens,

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa- ganini, G. Irving, O. Vinyals, S. Osindero, K. Si- monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improving lang...

work page 2022
[29]

Generalization through memoriza- tion: Nearest neighbor language models,

U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memoriza- tion: Nearest neighbor language models,” in 8th In- ternational Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . Open- Review.net, 2020

work page 2020
[30]

Chain-of- thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” in NeurIPS, 2022

work page 2022
[31]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” in NeurIPS, 2023

work page 2023
[32]

Self- consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self- consistency improves chain of thought reasoning in language models,” in ICLR. OpenReview.net, 2023

work page 2023
[33]

Large language models can be easily distracted by irrelevant context,

F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 31 210–31 227

work page 2023
[34]

Take a step back: Evoking reasoning via abstraction in large language models,

H. S. Zheng, S. Mishra, X. Chen, H. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” CoRR, vol. abs/2310.06117, 2023

work page arXiv 2023
[35]

Promptagator: Few-shot dense retrieval from 8 ex- amples,

Z. Dai, V . Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Promptagator: Few-shot dense retrieval from 8 ex- amples,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[36]

Query rewrit- ing for retrieval-augmented large language models,

X. Ma, Y. Gong, P . He, H. Zhao, and N. Duan, “Query rewriting for retrieval-augmented large lan- guage models,” CoRR, vol. abs/2305.14283, 2023

work page arXiv 2023
[37]

How context af- fects language models’ factual predictions,

F. Petroni, P . S. H. Lewis, A. Piktus, T. Rockt ¨aschel, Y. Wu, A. H. Miller, and S. Riedel, “How context af- fects language models’ factual predictions,” in AKBC, 2020

work page 2020
[38]

Re2g: Retrieve, rerank, generate,

M. R. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P . Cai, and A. Gliozzo, “Re2g: Retrieve, rerank, generate,” in NAACL-HLT. Association for Computational Linguistics, 2022, pp. 2701–2715

work page 2022
[39]

Walking down the memory maze: Beyond context limit through interactive reading,

H. Chen, R. Pasunuru, J. Weston, and A. Celiky- ilmaz, “Walking down the memory maze: Beyond context limit through interactive reading,” CoRR, vol. abs/2310.05029, 2023

work page arXiv 2023
[40]

Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,

J. Kim, J. Nam, S. Mo, J. Park, S. Lee, M. Seo, J. Ha, and J. Shin, “Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,” CoRR, vol. abs/2404.13081, 2024

work page arXiv 2024
[41]

RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,

F. Xu, W. Shi, and E. Choi, “RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,” in ICLR. OpenRe- view.net, 2024

work page 2024
[42]

BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,

J. Jin, Y. Zhu, Y. Zhou, and Z. Dou, “BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,” CoRR, vol. abs/2402.12174, 2024

work page arXiv 2024
[43]

PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,

H. Yang, Z. Li, Y. Zhang, J. Wang, N. Cheng, M. Li, and J. Xiao, “PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,” inEMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 5364– 5375

work page 2023
[44]

Self-knowledge guided retrieval augmentation for large language models,

Y. Wang, P . Li, M. Sun, and Y. Liu, “Self-knowledge guided retrieval augmentation for large language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 10 303–10 315

work page 2023
[45]

Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,

S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. Park, “Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, ...

work page 2024
[46]

React: Synergizing reason- ing and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reason- ing and acting in language models,” in ICLR. Open- Review.net, 2023

work page 2023
[47]

Measuring and narrowing the com- positionality gap in language models,

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the com- positionality gap in language models,” in EMNLP (Findings). Association for Computational Linguis- tics, 2023, pp. 5687–5711

work page 2023
[48]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab- harwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,” in ACL (1) . Association for Computational Linguistics, 2023, pp. 10 014–10 037

work page 2023
[49]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” 2024

work page 2024
[50]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” in NeurIPS, 2023

work page 2023
[51]

The role of chatgpt in scientific communication: writing better scientific review arti- cles,

J. Huang and M. Tan, “The role of chatgpt in scientific communication: writing better scientific review arti- cles,” American journal of cancer research , vol. 13, no. 4, p. 1148, 2023

work page 2023
[52]

When llm-based code genera- tion meets the software development process,

F. Lin, D. J. Kim et al., “When llm-based code genera- tion meets the software development process,” arXiv preprint arXiv:2403.15852, 2024

work page arXiv 2024
[53]

From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?

T. Feng, L. Qu, N. Tandon, Z. Li, X. Kang, and G. Haffari, “From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?” arXiv preprint arXiv:2407.19638, 2024

work page arXiv 2024
[54]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al. , “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv 19 preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Llm-driven robots risk enacting discrimination, violence, and unlawful actions,

R. Azeem, A. Hundt, M. Mansouri, and M. Brand ˜ao, “Llm-driven robots risk enacting discrimination, violence, and unlawful actions,” arXiv preprint arXiv:2406.08824, 2024

work page arXiv 2024
[56]

On protecting the data privacy of large language models (llms): A survey,

B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, and X. Cheng, “On protecting the data privacy of large language models (llms): A survey,” arXiv preprint arXiv:2403.05156, 2024

work page arXiv 2024
[57]

A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

F. Wu, N. Zhang, S. Jha, P . McDaniel, and C. Xiao, “A new era in llm security: Exploring security con- cerns in real-world llm-based systems,” arXiv preprint arXiv:2402.18649, 2024

work page arXiv 2024
[58]

SAIL: search-augmented instruction learning,

H. Luo, Y. Chuang, Y. Gong, T. Zhang, Y. Kim, X. Wu, D. Fox, H. Meng, and J. R. Glass, “SAIL: search-augmented instruction learning,” CoRR, vol. abs/2305.15225, 2023

work page arXiv 2023
[59]

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

B. Peng, M. Galley, P . He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and auto- mated feedback,” CoRR, vol. abs/2302.12813, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Generate rather than retrieve: Large language models are strong context generators,

W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang, “Generate rather than retrieve: Large language models are strong context generators,” in ICLR. OpenReview.net, 2023

work page 2023
[61]

Recall: A benchmark for llms robustness against external counterfactual knowledge,

Y. Liu, L. Huang, S. Li, S. Chen, H. Zhou, F. Meng, J. Zhou, and X. Sun, “RECALL: A benchmark for llms robustness against external counterfactual knowl- edge,” CoRR, vol. abs/2311.08147, 2023

work page arXiv 2023
[62]

On the risk of misinformation pollution with large language models,

Y. Pan, L. Pan, W. Chen, P . Nakov, M. Kan, and W. Y. Wang, “On the risk of misinformation pollution with large language models,” in EMNLP (Findings) . Association for Computational Linguistics, 2023, pp. 1389–1403

work page 2023
[63]

Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

S. Cho, S. Jeong, J. Seo, T. Hwang, and J. C. Park, “Ty- pos that broke the rag’s back: Genetic attack on RAG pipeline by simulating documents in the wild via low- level perturbations,” CoRR, vol. abs/2404.13948, 2024

work page arXiv 2024
[64]

Poi- soning retrieval corpora by injecting adversarial pas- sages,

Z. Zhong, Z. Huang, A. Wettig, and D. Chen, “Poi- soning retrieval corpora by injecting adversarial pas- sages,” in EMNLP. Association for Computational Linguistics, 2023, pp. 13 764–13 775

work page 2023
[65]

Attacking open-domain question answering by injecting misin- formation,

L. Pan, W. Chen, M. Kan, and W. Y. Wang, “Attacking open-domain question answering by injecting misin- formation,” in IJCNLP (1). Association for Computa- tional Linguistics, 2023, pp. 525–539

work page 2023
[66]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” inAISec@CCS. ACM, 2023, pp. 79–90

work page 2023
[67]

Defending against disinformation attacks in open-domain question answering,

O. Weller, A. Khan, N. Weir, D. J. Lawrie, and B. V . Durme, “Defending against disinformation attacks in open-domain question answering,” in EACL (2) . Association for Computational Linguistics, 2024, pp. 402–417

work page 2024
[68]

Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,

G. Hong, J. Kim, J. Kang, S. Myaeng, and J. J. Whang, “Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,” CoRR, vol. abs/2305.01579, 2023

work page arXiv 2023
[69]

Certifiably robust rag against retrieval corruption

C. Xiang, T. Wu, Z. Zhong, D. Wagner, D. Chen, and P . Mittal, “Certifiably robust rag against retrieval corruption,” arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024
[70]

Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,

H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, R. Lai, Z. Cao, J.-Y. Nie, and J.-R. Wen, “Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,” CoRR, vol. abs/2304.04358, 2023

work page arXiv 2023
[71]

Search-in-the- chain: Towards accurate, credible and traceable large language models for knowledgeintensive tasks,

S. Xu, L. Pang, H. Shen, X. Cheng, and T.- S. Chua, “Search-in-the-chain: Towards the accu- rate, credible and traceable content generation for complex knowledge-intensive tasks,” CoRR, vol. abs/2304.14732, 2023

work page arXiv 2023
[72]

Llatrieval: Llm-verified retrieval for verifiable gen- eration,

X. Li, C. Zhu, L. Li, Z. Yin, T. Sun, and X. Qiu, “Llatrieval: Llm-verified retrieval for verifiable gen- eration,” CoRR, vol. abs/2311.07838, 2023

work page arXiv 2023
[73]

Effective large language model adaptation for improved grounding,

X. Ye, R. Sun, S. ¨O. Arik, and T. Pfister, “Effective large language model adaptation for improved grounding,” CoRR, vol. abs/2311.09533, 2023

work page arXiv 2023
[74]

HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,

Y. Fang, S. W. Thomas, and X. Zhu, “HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,” CoRR, vol. abs/2402.09390, 2024

work page arXiv 2024
[75]

Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,

S. Xia, X. Wang, J. Liang, Y. Zhang, W. Zhou, J. Deng, F. Yu, and Y. Xiao, “Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,” arXiv preprint arXiv:2407.01796 , 2024

work page arXiv 2024
[76]

PURR: efficiently editing language model halluci- nations by denoising language model corruptions,

A. Chen, P . Pasupat, S. Singh, H. Lee, and K. Guu, “PURR: efficiently editing language model halluci- nations by denoising language model corruptions,” CoRR, vol. abs/2305.14908, 2023

work page arXiv 2023
[77]

Citation- enhanced generation for llm-based chatbots,

W. Li, J. Li, W. Ma, and Y. Liu, “Citation- enhanced generation for llm-based chatbots,” CoRR, vol. abs/2402.16063, 2024

work page arXiv 2024
[78]

Retrieving sup- porting evidence for generative question answering,

S. Huo, N. Arabzadeh, and C. Clarke, “Retrieving sup- porting evidence for generative question answering,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023, pp. 11–20

work page 2023
[79]

Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models,” CoRR, vol. abs/2402.07867, 2024

work page arXiv 2024
[80]

Phantom: General trigger attacks on retrieval augmented language generation,

H. Chaudhari, G. Severi, J. Abascal, M. Jagielski, C. A. Choquette-Choo, M. Nasr, C. Nita-Rotaru, and A. Oprea, “Phantom: General trigger attacks on retrieval augmented language generation,” arXiv preprint arXiv:2405.20485, 2024

work page arXiv 2024
[81]

Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,

D. Pasquini, M. Strohmeier, and C. Troncoso, “Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,” CoRR, vol. abs/2403.03792, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023

work page 2023

[2] [2]

Exploring the limits of transfer learning with a unified text-to- text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P . J. Liu, “Exploring the limits of transfer learning with a unified text-to- text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

work page 2020

[3] [3]

Leveraging passage re- trieval with generative models for open domain ques- tion answering,

G. Izacard and E. Grave, “Leveraging passage re- trieval with generative models for open domain ques- tion answering,” in EACL. Association for Computa- tional Linguistics, 2021, pp. 874–880

work page 2021

[4] [4]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saun- ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021. 17

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,

Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V . Do, Y. Xu, and P . Fung, “A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucina- tion, and interactivity,” inIJCNLP (1). Association for Computational Linguistics, 2023, pp. 675–718

work page 2023

[6] [6]

Unsupervised real-time hallucination detec- tion based on the internal states of large language models,

W. Su, C. Wang, Q. Ai, Y. Hu, Z. Wu, Y. Zhou, and Y. Liu, “Unsupervised real-time hallucination detec- tion based on the internal states of large language models,” in ACL (Findings) . Association for Com- putational Linguistics, 2024, pp. 14 379–14 391

work page 2024

[7] [7]

Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,

Y. Li, M. Du, R. Song, X. Wang, M. Sun, and Y. Wang, “Mitigating social biases of pre-trained lan- guage models via contrastive self-debiasing with dou- ble data augmentation,” Artificial Intelligence, vol. 332, p. 104143, 2024

work page 2024

[8] [8]

Merging generated and retrieved knowledge for open-domain QA,

Y. Zhang, M. Khalifa, L. Logeswaran, M. Lee, H. Lee, and L. Wang, “Merging generated and retrieved knowledge for open-domain QA,” in EMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 4710– 4728

work page 2023

[9] [9]

S. Pal, M. Bhattacharya, M. A. Islam, and C. Chakraborty, “Chatgpt or llm in next-generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artifi- cial intelligence-based device for a faster way of drug discovery and development,” International Journal of Surgery, vol. 109, no. 12, pp. 4382–4384, 2023

work page 2023

[10] [10]

REPLUG: Retrieval-Augmented Black-Box Language Models

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. Yih, “REPLUG: retrieval-augmented black-box language models,” CoRR, vol. abs/2301.12652, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Benchmarking large language models in retrieval-augmented gener- ation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented gener- ation,” in AAAI. AAAI Press, 2024, pp. 17 754–17 762

work page 2024

[12] [12]

Revolutionizing finance with llms: An overview of applications and insights,

H. Zhao, Z. Liu, Z. Wu, Y. Li, T. Yang, P . Shu, S. Xu, H. Dai, L. Zhao, G. Maiet al., “Revolutionizing finance with llms: An overview of applications and insights,” arXiv preprint arXiv:2401.11641, 2024

work page arXiv 2024

[13] [13]

Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,

A. Ghosh, A. Acharya, R. Jain, S. Saha, A. Chadha, and S. Sinha, “Clipsyntel: clip and llm synergy for multimodal question summarization in healthcare,” in Proceedings of the AAAI Conference on Artificial In- telligence, vol. 38, no. 20, 2024, pp. 22 031–22 039

work page 2024

[14] [15]

Selecmix: Debiased learning by contradicting-pair sampling,

I. Hwang, S. Lee, Y. Kwak, S. J. Oh, D. Teney, J.-H. Kim, and B.-T. Zhang, “Selecmix: Debiased learning by contradicting-pair sampling,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 345– 14 357, 2022

work page 2022

[15] [16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algo- rithms,” CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [17]

Ew-tune: A framework for privately fine-tuning large language models with differential privacy,

R. Behnia, M. Ebrahimi, J. Pacheco, and B. Padmanab- han, “Ew-tune: A framework for privately fine-tuning large language models with differential privacy,” in ICDM (Workshops). IEEE, 2022, pp. 560–566

work page 2022

[17] [18]

On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,

T. Y. Zhuo, Z. Li, Y. Huang, F. Shiri, W. Wang, G. Haffari, and Y. Li, “On robustness of prompt- based semantic parsing with large pre-trained lan- guage model: An empirical study on codex,” in EACL. Association for Computational Linguistics, 2023, pp. 1090–1102

work page 2023

[18] [19]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Y. Liu, Y. Yao, J. Ton, X. Zhang, R. Guo, H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trust- worthy llms: a survey and guideline for evaluat- ing large language models’ alignment,” CoRR, vol. abs/2308.05374, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [20]

TrustLLM: Trustworthiness in Large Language Models

L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li, Z. Liu, Y. Liu, Y. Wang, Z. Zhang, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P . Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang, J. Wang, J. Mitchell, K. Shu,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [21]

Atlas: Few-shot learning with retrieval augmented language models,

G. Izacard, P . S. H. Lewis, M. Lomeli, L. Hos- seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” J. Mach. Learn. Res., vol. 24, pp. 251:1–251:43, 2023

work page 2023

[21] [22]

Ragbench: Explain- able benchmark for retrieval-augmented generation systems,

R. Friel, M. Belyi, and A. Sanyal, “Ragbench: Explain- able benchmark for retrieval-augmented generation systems,” 2024

work page 2024

[22] [23]

Rag- ex: A generic framework for explaining retrieval aug- mented generation,

V . Sudhi, S. R. Bhat, M. Rudat, and R. Teucher, “Rag- ex: A generic framework for explaining retrieval aug- mented generation,” in SIGIR. ACM, 2024, pp. 2776– 2780

work page 2024

[23] [24]

Fairrag: Fair human generation via fair retrieval aug- mentation,

R. Shrestha, Y. Zou, Q. Chen, Z. Li, Y. Xie, and S. Deng, “Fairrag: Fair human generation via fair retrieval aug- mentation,” CoRR, vol. abs/2403.19964, 2024

work page arXiv 2024

[24] [25]

Active retrieval augmented generation,

Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” in EMNLP. Association for Computational Linguistics, 2023, pp. 7969–7992

work page 2023

[25] [26]

Retrieval- augmented generation for knowledge-intensive NLP tasks,

P . S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval- augmented generation for knowledge-intensive NLP tasks,” in NeurIPS, 2020

work page 2020

[26] [27]

Retrieval augmented language model pre-training,

K. Guu, K. Lee, Z. Tung, P . Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in ICML, ser. Proceedings of Machine Learning Re- search, vol. 119. PMLR, 2020, pp. 3929–3938

work page 2020

[27] [28]

Improving language models by retrieving from trillions of to- 18 kens,

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa- ganini, G. Irving, O. Vinyals, S. Osindero, K. Si- monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improving lang...

work page 2022

[28] [29]

Generalization through memoriza- tion: Nearest neighbor language models,

U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, “Generalization through memoriza- tion: Nearest neighbor language models,” in 8th In- ternational Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . Open- Review.net, 2020

work page 2020

[29] [30]

Chain-of- thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” in NeurIPS, 2022

work page 2022

[30] [31]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” in NeurIPS, 2023

work page 2023

[31] [32]

Self- consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self- consistency improves chain of thought reasoning in language models,” in ICLR. OpenReview.net, 2023

work page 2023

[32] [33]

Large language models can be easily distracted by irrelevant context,

F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” in ICML, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 31 210–31 227

work page 2023

[33] [34]

Take a step back: Evoking reasoning via abstraction in large language models,

H. S. Zheng, S. Mishra, X. Chen, H. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” CoRR, vol. abs/2310.06117, 2023

work page arXiv 2023

[34] [35]

Promptagator: Few-shot dense retrieval from 8 ex- amples,

Z. Dai, V . Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Promptagator: Few-shot dense retrieval from 8 ex- amples,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023

[35] [36]

Query rewrit- ing for retrieval-augmented large language models,

X. Ma, Y. Gong, P . He, H. Zhao, and N. Duan, “Query rewriting for retrieval-augmented large lan- guage models,” CoRR, vol. abs/2305.14283, 2023

work page arXiv 2023

[36] [37]

How context af- fects language models’ factual predictions,

F. Petroni, P . S. H. Lewis, A. Piktus, T. Rockt ¨aschel, Y. Wu, A. H. Miller, and S. Riedel, “How context af- fects language models’ factual predictions,” in AKBC, 2020

work page 2020

[37] [38]

Re2g: Retrieve, rerank, generate,

M. R. Glass, G. Rossiello, M. F. M. Chowdhury, A. Naik, P . Cai, and A. Gliozzo, “Re2g: Retrieve, rerank, generate,” in NAACL-HLT. Association for Computational Linguistics, 2022, pp. 2701–2715

work page 2022

[38] [39]

Walking down the memory maze: Beyond context limit through interactive reading,

H. Chen, R. Pasunuru, J. Weston, and A. Celiky- ilmaz, “Walking down the memory maze: Beyond context limit through interactive reading,” CoRR, vol. abs/2310.05029, 2023

work page arXiv 2023

[39] [40]

Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,

J. Kim, J. Nam, S. Mo, J. Park, S. Lee, M. Seo, J. Ha, and J. Shin, “Sure: Summarizing retrievals using answer candidates for open-domain QA of llms,” CoRR, vol. abs/2404.13081, 2024

work page arXiv 2024

[40] [41]

RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,

F. Xu, W. Shi, and E. Choi, “RECOMP: improving retrieval-augmented lms with context compression and selective augmentation,” in ICLR. OpenRe- view.net, 2024

work page 2024

[41] [42]

BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,

J. Jin, Y. Zhu, Y. Zhou, and Z. Dou, “BIDER: bridg- ing knowledge inconsistency for efficient retrieval- augmented llms via key supporting evidence,” CoRR, vol. abs/2402.12174, 2024

work page arXiv 2024

[42] [43]

PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,

H. Yang, Z. Li, Y. Zhang, J. Wang, N. Cheng, M. Li, and J. Xiao, “PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter,” inEMNLP. Asso- ciation for Computational Linguistics, 2023, pp. 5364– 5375

work page 2023

[43] [44]

Self-knowledge guided retrieval augmentation for large language models,

Y. Wang, P . Li, M. Sun, and Y. Liu, “Self-knowledge guided retrieval augmentation for large language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 10 303–10 315

work page 2023

[44] [45]

Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,

S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. Park, “Adaptive-rag: Learning to adapt retrieval- augmented large language models through question complexity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, ...

work page 2024

[45] [46]

React: Synergizing reason- ing and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reason- ing and acting in language models,” in ICLR. Open- Review.net, 2023

work page 2023

[46] [47]

Measuring and narrowing the com- positionality gap in language models,

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the com- positionality gap in language models,” in EMNLP (Findings). Association for Computational Linguis- tics, 2023, pp. 5687–5711

work page 2023

[47] [48]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab- harwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step ques- tions,” in ACL (1) . Association for Computational Linguistics, 2023, pp. 10 014–10 037

work page 2023

[48] [49]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” 2024

work page 2024

[49] [50]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” in NeurIPS, 2023

work page 2023

[50] [51]

The role of chatgpt in scientific communication: writing better scientific review arti- cles,

J. Huang and M. Tan, “The role of chatgpt in scientific communication: writing better scientific review arti- cles,” American journal of cancer research , vol. 13, no. 4, p. 1148, 2023

work page 2023

[51] [52]

When llm-based code genera- tion meets the software development process,

F. Lin, D. J. Kim et al., “When llm-based code genera- tion meets the software development process,” arXiv preprint arXiv:2403.15852, 2024

work page arXiv 2024

[52] [53]

From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?

T. Feng, L. Qu, N. Tandon, Z. Li, X. Kang, and G. Haffari, “From pre-training corpora to large lan- guage models: What factors influence llm perfor- mance in causal discovery tasks?” arXiv preprint arXiv:2407.19638, 2024

work page arXiv 2024

[53] [54]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al. , “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv 19 preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [55]

Llm-driven robots risk enacting discrimination, violence, and unlawful actions,

R. Azeem, A. Hundt, M. Mansouri, and M. Brand ˜ao, “Llm-driven robots risk enacting discrimination, violence, and unlawful actions,” arXiv preprint arXiv:2406.08824, 2024

work page arXiv 2024

[55] [56]

On protecting the data privacy of large language models (llms): A survey,

B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, and X. Cheng, “On protecting the data privacy of large language models (llms): A survey,” arXiv preprint arXiv:2403.05156, 2024

work page arXiv 2024

[56] [57]

A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

F. Wu, N. Zhang, S. Jha, P . McDaniel, and C. Xiao, “A new era in llm security: Exploring security con- cerns in real-world llm-based systems,” arXiv preprint arXiv:2402.18649, 2024

work page arXiv 2024

[57] [58]

SAIL: search-augmented instruction learning,

H. Luo, Y. Chuang, Y. Gong, T. Zhang, Y. Kim, X. Wu, D. Fox, H. Meng, and J. R. Glass, “SAIL: search-augmented instruction learning,” CoRR, vol. abs/2305.15225, 2023

work page arXiv 2023

[58] [59]

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

B. Peng, M. Galley, P . He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and auto- mated feedback,” CoRR, vol. abs/2302.12813, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [60]

Generate rather than retrieve: Large language models are strong context generators,

W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang, “Generate rather than retrieve: Large language models are strong context generators,” in ICLR. OpenReview.net, 2023

work page 2023

[60] [61]

Recall: A benchmark for llms robustness against external counterfactual knowledge,

Y. Liu, L. Huang, S. Li, S. Chen, H. Zhou, F. Meng, J. Zhou, and X. Sun, “RECALL: A benchmark for llms robustness against external counterfactual knowl- edge,” CoRR, vol. abs/2311.08147, 2023

work page arXiv 2023

[61] [62]

On the risk of misinformation pollution with large language models,

Y. Pan, L. Pan, W. Chen, P . Nakov, M. Kan, and W. Y. Wang, “On the risk of misinformation pollution with large language models,” in EMNLP (Findings) . Association for Computational Linguistics, 2023, pp. 1389–1403

work page 2023

[62] [63]

Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

S. Cho, S. Jeong, J. Seo, T. Hwang, and J. C. Park, “Ty- pos that broke the rag’s back: Genetic attack on RAG pipeline by simulating documents in the wild via low- level perturbations,” CoRR, vol. abs/2404.13948, 2024

work page arXiv 2024

[63] [64]

Poi- soning retrieval corpora by injecting adversarial pas- sages,

Z. Zhong, Z. Huang, A. Wettig, and D. Chen, “Poi- soning retrieval corpora by injecting adversarial pas- sages,” in EMNLP. Association for Computational Linguistics, 2023, pp. 13 764–13 775

work page 2023

[64] [65]

Attacking open-domain question answering by injecting misin- formation,

L. Pan, W. Chen, M. Kan, and W. Y. Wang, “Attacking open-domain question answering by injecting misin- formation,” in IJCNLP (1). Association for Computa- tional Linguistics, 2023, pp. 525–539

work page 2023

[65] [66]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” inAISec@CCS. ACM, 2023, pp. 79–90

work page 2023

[66] [67]

Defending against disinformation attacks in open-domain question answering,

O. Weller, A. Khan, N. Weir, D. J. Lawrie, and B. V . Durme, “Defending against disinformation attacks in open-domain question answering,” in EACL (2) . Association for Computational Linguistics, 2024, pp. 402–417

work page 2024

[67] [68]

Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,

G. Hong, J. Kim, J. Kang, S. Myaeng, and J. J. Whang, “Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise,” CoRR, vol. abs/2305.01579, 2023

work page arXiv 2023

[68] [69]

Certifiably robust rag against retrieval corruption

C. Xiang, T. Wu, Z. Zhong, D. Wagner, D. Chen, and P . Mittal, “Certifiably robust rag against retrieval corruption,” arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024

[69] [70]

Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,

H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, R. Lai, Z. Cao, J.-Y. Nie, and J.-R. Wen, “Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,” CoRR, vol. abs/2304.04358, 2023

work page arXiv 2023

[70] [71]

Search-in-the- chain: Towards accurate, credible and traceable large language models for knowledgeintensive tasks,

S. Xu, L. Pang, H. Shen, X. Cheng, and T.- S. Chua, “Search-in-the-chain: Towards the accu- rate, credible and traceable content generation for complex knowledge-intensive tasks,” CoRR, vol. abs/2304.14732, 2023

work page arXiv 2023

[71] [72]

Llatrieval: Llm-verified retrieval for verifiable gen- eration,

X. Li, C. Zhu, L. Li, Z. Yin, T. Sun, and X. Qiu, “Llatrieval: Llm-verified retrieval for verifiable gen- eration,” CoRR, vol. abs/2311.07838, 2023

work page arXiv 2023

[72] [73]

Effective large language model adaptation for improved grounding,

X. Ye, R. Sun, S. ¨O. Arik, and T. Pfister, “Effective large language model adaptation for improved grounding,” CoRR, vol. abs/2311.09533, 2023

work page arXiv 2023

[73] [74]

HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,

Y. Fang, S. W. Thomas, and X. Zhu, “HGOT: hierar- chical graph of thoughts for retrieval-augmented in- context learning in factuality evaluation,” CoRR, vol. abs/2402.09390, 2024

work page arXiv 2024

[74] [75]

Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,

S. Xia, X. Wang, J. Liang, Y. Zhang, W. Zhou, J. Deng, F. Yu, and Y. Xiao, “Ground every sentence: Improving retrieval-augmented llms with interleaved reference- claim generation,” arXiv preprint arXiv:2407.01796 , 2024

work page arXiv 2024

[75] [76]

PURR: efficiently editing language model halluci- nations by denoising language model corruptions,

A. Chen, P . Pasupat, S. Singh, H. Lee, and K. Guu, “PURR: efficiently editing language model halluci- nations by denoising language model corruptions,” CoRR, vol. abs/2305.14908, 2023

work page arXiv 2023

[76] [77]

Citation- enhanced generation for llm-based chatbots,

W. Li, J. Li, W. Ma, and Y. Liu, “Citation- enhanced generation for llm-based chatbots,” CoRR, vol. abs/2402.16063, 2024

work page arXiv 2024

[77] [78]

Retrieving sup- porting evidence for generative question answering,

S. Huo, N. Arabzadeh, and C. Clarke, “Retrieving sup- porting evidence for generative question answering,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023, pp. 11–20

work page 2023

[78] [79]

Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models,” CoRR, vol. abs/2402.07867, 2024

work page arXiv 2024

[79] [80]

Phantom: General trigger attacks on retrieval augmented language generation,

H. Chaudhari, G. Severi, J. Abascal, M. Jagielski, C. A. Choquette-Choo, M. Nasr, C. Nita-Rotaru, and A. Oprea, “Phantom: General trigger attacks on retrieval augmented language generation,” arXiv preprint arXiv:2405.20485, 2024

work page arXiv 2024

[80] [81]

Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,

D. Pasquini, M. Strohmeier, and C. Troncoso, “Neu- ral exec: Learning (and learning from) execution triggers for prompt injection attacks,” CoRR, vol. abs/2403.03792, 2024

work page arXiv 2024