How Do LLMs Cite? A Mechanistic Interpretation of Attribution in Retrieval-Augmented Generation

Ian van Dort (University of Amsterdam); Maria Heuss (University of Amsterdam)

arxiv: 2606.28358 · v1 · pith:NFUF4BBEnew · submitted 2026-06-09 · 💻 cs.IR · cs.AI· cs.CL

How Do LLMs Cite? A Mechanistic Interpretation of Attribution in Retrieval-Augmented Generation

Ian van Dort (University of Amsterdam) , Maria Heuss (University of Amsterdam) This is my paper

Pith reviewed 2026-06-30 11:09 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords retrieval augmented generationmechanistic interpretabilityactivation patchinginline citationslarge language modelsattributionfactoid questions

0 comments

The pith

LLMs rely on a distributed ensemble of attention heads and MLP layers to decide on inline citations in RAG outputs, allowing targeted interventions to correct most citation mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the internal mechanism by which large language models choose to include inline citations when answering questions using retrieved documents. Through activation patching experiments on Llama-3.1-8B-Instruct with the PopQA dataset, it identifies a set of critical components that control citation behavior. The authors demonstrate that editing these components can restore over 90% of missed citations and remove 69% of incorrect ones while preserving answer quality. This matters because it shows that citation decisions are mechanistically separate from the reasoning process itself, potentially undermining the trustworthiness that citations are meant to provide. The findings hold directionally on the more complex HotpotQA benchmark as well.

Core claim

The central claim is that citation attribution in retrieval-augmented generation emerges from a distributed, multi-stage attributional ensemble of attention heads and MLP layers rather than any single localized component. By using activation patching to map this ensemble in Llama-3.1-8B-Instruct on PopQA, the work shows that selectively amplifying or attenuating these components repairs over 90% of missed citations and eliminates 69% of spurious ones without harming answer accuracy, and produces similar directional effects on HotpotQA.

What carries the argument

An attributional ensemble consisting of multiple attention heads and MLP layers that collectively govern the decision to attach an inline citation to an answer.

If this is right

The mechanism for citation is distributed rather than localized in one part of the model.
Targeted editing of these components can substantially improve citation faithfulness in RAG systems.
The same components influence citation behavior across different datasets like PopQA and HotpotQA.
Apparent citation use may not reflect the model's actual internal attribution process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Post-hoc editing of the ensemble could be used to enhance citation accuracy in deployed models.
Similar mechanistic approaches might reveal how LLMs handle other forms of attribution or source grounding.
The disconnect implies that users should not rely solely on inline citations for verifying model outputs.

Load-bearing premise

Activation patching on the identified heads and MLPs directly isolates the causal drivers of citation decisions rather than just changing generation behavior in unrelated ways.

What would settle it

A new experiment where patching the same components on a different model or dataset fails to correct citation errors while leaving answer accuracy unchanged would show that the ensemble is not the general causal mechanism.

Figures

Figures reproduced from arXiv: 2606.28358 by Ian van Dort (University of Amsterdam), Maria Heuss (University of Amsterdam).

**Figure 1.** Figure 1: Residual stream (denoising). Core Structural Pattern: A Distributed Attributional Ensemble. We find that the decision to cite is governed not by a single “citation head” but by a distributed and fragile attributional ensemble of many attention heads and MLPs, where “fragile” means that small corruptions at many sites strongly reduce citation, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Key mechanistic signals: (a) token/layer regions whose clean MLP acti [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of mechanistic processes that contribute to the citation gen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Targeted PopQA repairs via scaling identified components. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization of identified components from PopQA to HotpotQA. Left: [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) aims to enhance the trustworthiness of Large Language Models (LLMs) by grounding their outputs in external documents, often using inline citations for verifiability. However, the faithfulness of these citations -- whether the model genuinely uses a source to generate an answer -- remains a critical, unverified assumption. This paper offers the first mechanistic account of how a large language model decides whether to attach an inline citation while answering a factoid question. Using the Llama-3.1-8B-Instruct model in a controlled experimental environment based on the PopQA dataset, we employ an activation patching approach. We map the underlying mechanism responsible for citation, discovering that it is not a single, localized component but a distributed, multi-stage "attributional ensemble" of attention heads and MLP layers. We show that amplifying or attenuating only those critical heads and MLPs repairs over 90% of missed citations and eliminates 69% of spurious ones on PopQA without harming answer accuracy. Although gains on the multi-document HotpotQA benchmark are modest, the same component set still moves citation rates in the intended direction, indicating that the underlying mechanism is not dataset-specific. The results reveal a potential disconnect between the model's apparent reasoning and its internal computational pathway, suggesting that inline citations can create a false sense of security.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Patching locates a distributed set of heads and MLPs that shift citation rates in Llama-3.1 RAG, with large effects on PopQA and smaller ones on HotpotQA.

read the letter

The main point is that citation decisions here are handled by a spread-out collection of attention heads and MLPs rather than any single module, and intervening on just those components recovers most missed citations and removes many spurious ones on PopQA without dropping answer accuracy.

The paper does the new step of applying activation patching to map the internal pathway for inline citations instead of stopping at accuracy measurements. The reported numbers are concrete: over 90 percent repair of missed citations and 69 percent elimination of bad ones on the primary dataset, plus directional movement when the same components are tested on HotpotQA. That second dataset check is a reasonable check against overfitting to one benchmark.

The soft spots are the missing error bars, the lack of detail on how the critical heads were chosen, and no ablation of the patching protocol itself. More critically, the interventions change citation rates but it is not yet shown that they specifically target the computation checking whether the generated answer is supported by a retrieved source; they could be affecting a generic citation-token policy or downstream formatting. Preserving answer accuracy narrows the alternatives but does not close them.

This is for people working on mechanistic interpretability of RAG systems. A reader who wants concrete components to edit for better citation behavior will find usable results here. The experimental specificity is enough to merit peer review rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper claims to deliver the first mechanistic account of inline citation decisions in RAG by applying activation patching to Llama-3.1-8B-Instruct on PopQA. It identifies a distributed 'attributional ensemble' of attention heads and MLPs whose amplification or attenuation repairs >90% of missed citations and eliminates 69% of spurious ones while preserving answer accuracy; the same components produce directional improvements on HotpotQA, suggesting the mechanism is not dataset-specific and revealing a potential disconnect between surface citations and internal attribution.

Significance. If the interventions isolate attribution computation rather than generic citation formatting, the work would be significant for mechanistic interpretability of RAG faithfulness. It supplies concrete, cross-dataset intervention results on a public model and highlights that citation behavior can be edited without accuracy loss, which could inform targeted reliability improvements. The distributed 'ensemble' finding challenges assumptions of localized citation circuitry.

major comments (3)

[Abstract, §4] Abstract and §4 (PopQA results): the reported 90% repair and 69% elimination rates are presented without error bars, without the precise head-selection procedure, and without an ablation of the patching protocol itself. These quantities are load-bearing for the central claim yet lack the statistical and methodological detail needed for verification.
[§3, §5] §3 (activation-patching protocol) and §5 (discussion of mechanism): no experiment holds generated answer content fixed while varying only source-grounding evidence. Preservation of answer accuracy therefore does not rule out effects on a generic 'emit citation token' policy or downstream formatting circuitry rather than on attribution verification.
[§4.3] §4.3 (HotpotQA transfer): the claim that the component set 'moves citation rates in the intended direction' and is 'not dataset-specific' rests on modest, directional changes without reported effect sizes, statistical tests, or a quantitative comparison to PopQA. This weakens the generalization argument.

minor comments (2)

[§3] Notation for the 'attributional ensemble' is introduced without a formal definition or pseudocode; a concise algorithmic description would improve reproducibility.
[Figures 3-5] Figure captions and axis labels in the patching results should explicitly state the number of runs and whether shaded regions represent standard error or standard deviation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (PopQA results): the reported 90% repair and 69% elimination rates are presented without error bars, without the precise head-selection procedure, and without an ablation of the patching protocol itself. These quantities are load-bearing for the central claim yet lack the statistical and methodological detail needed for verification.

Authors: We agree that the reported rates require additional statistical and methodological detail for full verifiability. In the revised manuscript we will add error bars computed across multiple random seeds to the 90% and 69% figures in both the abstract and §4. We will also expand the description of the head- and layer-selection procedure with explicit metrics, thresholds, and selection criteria. Finally, we will include an ablation comparing the full patching protocol against random-component and layer-only baselines. These additions will be incorporated into §4. revision: yes
Referee: [§3, §5] §3 (activation-patching protocol) and §5 (discussion of mechanism): no experiment holds generated answer content fixed while varying only source-grounding evidence. Preservation of answer accuracy therefore does not rule out effects on a generic 'emit citation token' policy or downstream formatting circuitry rather than on attribution verification.

Authors: The activation-patching protocol holds the input prompt (question plus retrieved documents) fixed and intervenes only on internal activations of the identified components. This design isolates changes to attribution computation while the evidence remains constant. We will revise §3 to emphasize this isolation and expand §5 to discuss why a fully fixed-answer-content experiment is difficult in autoregressive generation, where citation decisions are entangled with content production. We maintain that the component-specific nature of the interventions supports attribution rather than generic formatting, but we acknowledge the referee's point on the limits of the current controls. revision: partial
Referee: [§4.3] §4.3 (HotpotQA transfer): the claim that the component set 'moves citation rates in the intended direction' and is 'not dataset-specific' rests on modest, directional changes without reported effect sizes, statistical tests, or a quantitative comparison to PopQA. This weakens the generalization argument.

Authors: We agree that quantitative details would strengthen the generalization claim. In the revision we will report effect sizes for the observed citation-rate changes on HotpotQA, include statistical tests (e.g., paired t-tests with p-values), and add a direct quantitative comparison (percentage-point deltas and confidence intervals) between the PopQA and HotpotQA results. These additions will appear in §4.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity; experimental results rely on external benchmarks and interventions

full rationale

The paper reports an activation-patching study on Llama-3.1-8B-Instruct using the external PopQA and HotpotQA benchmarks. Critical heads and MLPs are identified via patching experiments and then intervened upon to measure changes in citation rates while preserving answer accuracy. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The derivation chain consists of empirical measurements on held-out data rather than any reduction of outputs to the inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that activation patching cleanly isolates the citation mechanism and that the PopQA and HotpotQA results generalize beyond the tested model and prompts. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Activation patching on attention heads and MLPs isolates the causal pathway for citation decisions rather than downstream generation effects.
Invoked when the authors interpret patching results as revealing the attribution mechanism.

invented entities (1)

attributional ensemble no independent evidence
purpose: Label for the distributed set of heads and MLPs responsible for citation decisions.
Introduced to describe the non-localized mechanism; no independent evidence supplied beyond the patching results.

pith-pipeline@v0.9.1-grok · 5781 in / 1399 out tokens · 24758 ms · 2026-06-30T11:09:37.021332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 21 canonical work pages · 8 internal anchors

[1]

https://doi.org/10.48550/ arXiv.2311.01463, http://arxiv.org/abs/2311.01463, arXiv:2311.01463 [cs]

Ahmad, M.A., Yaramis, I., Roy, T.D.: Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI (Sep 2023). https://doi.org/10.48550/ arXiv.2311.01463, http://arxiv.org/abs/2311.01463, arXiv:2311.01463 [cs]

work page arXiv 2023
[2]

https://doi.org/10.48550/ARXIV.2212.08037, https: //arxiv.org/abs/2212.08037, publisher: arXiv Version Number: 2

Bohnet, B., Tran, V.Q., Verga, P., Aharoni, R., Andor, D., Soares, L.B., Ciaramita, M., Eisenstein, J., Ganchev, K., Herzig, J., Hui, K., Kwiatkowski, T., Ma, J., Ni, J., Saralegui, L.S., Schuster, T., Cohen, W.W., Collins, M., Das, D., Metzler, D., Petrov, S., Webster, K.: Attributed Question Answering: Evaluation and Modeling for Attributed Large Lan- g...

work page doi:10.48550/arxiv.2212.08037 2022
[3]

Reasoning Models Don't Always Say What They Think

Chen,Y.,Benton,J.,Radhakrishnan,A.,Uesato,J.,Denison,C.,Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V., Bowman, S.R., Leike, J., Kaplan, J., Perez, E.: Reasoning Models Don’t Always Say What They Think (May 2025). https://doi.org/10.48550/arXiv.2505.05410, http://arxiv.org/abs/2505.05410, arXiv:2505.05410 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.05410 2025
[4]

https://doi.org/10.48550/ arXiv.2409.00729, http://arxiv.org/abs/2409.00729, arXiv:2409.00729 [cs]

Cohen-Wang, B., Shah, H., Georgiev, K., Madry, A.: ContextCite: Attribut- ing Model Generation to Context (Sep 2024). https://doi.org/10.48550/ arXiv.2409.00729, http://arxiv.org/abs/2409.00729, arXiv:2409.00729 [cs]

work page arXiv 2024
[5]

Gao,T.,Yen,H.,Yu,J.,Chen,D.:EnablingLargeLanguageModelstoGen- erateTextwithCitations.In:Bouamor,H.,Pino,J.,Bali,K.(eds.)Proceed- ingsofthe2023ConferenceonEmpiricalMethodsinNaturalLanguagePro- cessing. pp. 6465–6488. Association for Computational Linguistics, Singa- pore (Dec 2023). https://doi.org/10.18653/v1/2023.emnlp-main.398, https: //aclanthology.org...

work page doi:10.18653/v1/2023.emnlp-main.398 2023
[6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-Augmented Generation for Large Language Models: A Survey (Mar 2024). https://doi.org/10.48550/arXiv.2312.10997, http://arxiv.org/abs/2312.10997, arXiv:2312.10997 version: 5

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
[7]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A.,Hinsvark,A.,Rao,A.,Zhang,A.,Rodriguez,A.,Gregerson,A.,Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[8]

(eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Jacovi, A., Goldberg, Y.: Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4198–4205. Association forComputationalLinguistics,Online(Jul2020).https://doi.or...

2020
[9]

Learning and individual differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fis- cher, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chat- gpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

2023
[10]

AMIA Annual Symposium Proceedings2006, 469–473 (2006), https://www.ncbi.nlm.nih

Lee, M., Cimino, J., Zhu, H.R., Sable, C., Shanker, V., Ely, J., Yu, H.: Be- yond Information Retrieval—Medical Question Answering. AMIA Annual Symposium Proceedings2006, 469–473 (2006), https://www.ncbi.nlm.nih. 16 I. van Dort and M. Heuss gov/pmc/articles/PMC1839371/

2006
[11]

In: Advances in Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/ ...

2020
[12]

https://doi.org/10.48550/arXiv.2505.16415, http://arxiv.org/ abs/2505.16415, arXiv:2505.16415 [cs]

Li, R., Chen, C., Hu, Y., Gao, Y., Wang, X., Yilmaz, E.: Attribut- ing Response to Context: A Jensen-Shannon Divergence Driven Mecha- nistic Study of Context Attribution in Retrieval-Augmented Generation (May 2025). https://doi.org/10.48550/arXiv.2505.16415, http://arxiv.org/ abs/2505.16415, arXiv:2505.16415 [cs]

work page doi:10.48550/arxiv.2505.16415 2025
[13]

https://doi.org/10.48550/arXiv.2304.09848, http:// arxiv.org/abs/2304.09848, arXiv:2304.09848 [cs]

Liu,N.F.,Zhang,T.,Liang,P.:EvaluatingVerifiabilityinGenerativeSearch Engines (Oct 2023). https://doi.org/10.48550/arXiv.2304.09848, http:// arxiv.org/abs/2304.09848, arXiv:2304.09848 [cs]

work page doi:10.48550/arxiv.2304.09848 2023
[15]

https://doi.org/10.48550/arXiv.2405.20362, http: //arxiv.org/abs/2405.20362, arXiv:2405.20362 [cs]

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Re- search Tools (May 2024). https://doi.org/10.48550/arXiv.2405.20362, http: //arxiv.org/abs/2405.20362, arXiv:2405.20362 [cs]

work page doi:10.48550/arxiv.2405.20362 2024
[16]

In: Rogers, A., Boyd- Graber, J., Okazaki, N

Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., Hajishirzi, H.: When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers). pp. 9802–98...

work page doi:10.18653/v1/2023.acl-long.546 2023
[17]

Reuters (Jun 2023), https://www.reuters.com/legal/ new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/

Merken, S.: New York lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters (Jun 2023), https://www.reuters.com/legal/ new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/

2023
[18]

Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences

Miller, T., Howe, P., Sonenberg, L.: Explainable AI: beware of inmates run- ning the asylum or: How I learnt to stop worrying and love the social and be- havioural sciences. CoRRabs/1712.00547(2017), http://arxiv.org/abs/ 1712.00547

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Annals of Internal Medicine (Jun 2025)

Modi, N.D., Menz, B.D., Awaty, A.A., Alex, C.A., Logan, J.M., McKinnon, R.A., Rowland, A., Bacchi, S., Gradon, K., Sorich, M.J., Hopkins, A.M.: Assessing the System-Instruction Vulnerabilities of Large Language Mod- els to Malicious Conversion Into Health Disinformation Chatbots. Annals of Internal Medicine (Jun 2025). https://doi.org/10.7326/ANNALS-24-03...

work page doi:10.7326/annals-24-03933 2025
[20]

Muller, B., Wieting, J., Clark, J., Kwiatkowski, T., Ruder, S., Soares, L., Aharoni, R., Herzig, J., Wang, X.: Evaluating and Modeling Attribution for How Do LLMs Cite? 17 Cross-LingualQuestionAnswering.In:Bouamor,H.,Pino,J.,Bali,K.(eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 144–157. Association for...

work page doi:10.18653/v1/2023.emnlp-main.10 2023
[21]

Nanda, N., Bloom, J.: TransformerLens (2022), https://github.com/ TransformerLensOrg/TransformerLens

2022
[22]

https://doi.org/10.48550/ arXiv.2405.17980, http://arxiv.org/abs/2405.17980, arXiv:2405.17980 [cs]

Phukan, A., Somasundaram, S., Saxena, A., Goswami, K., Srinivasan, B.V.: Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering (May 2024). https://doi.org/10.48550/ arXiv.2405.17980, http://arxiv.org/abs/2405.17980, arXiv:2405.17980 [cs]

work page arXiv 2024
[23]

Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting

Qi, J., Sarti, G., Fernández, R., Bisazza, A.: Model Internals-based An- swer Attribution for Trustworthy Retrieval-Augmented Generation. In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 6037–6053 (2024). https://doi.org/10.18653/v1/2024. emnlp-main.347, http://arxiv.org/abs/2406.13663, arXiv:2406.13663 [cs]

work page doi:10.18653/v1/2024 2024
[24]

In: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization

Sadeghi, M., Pöttgen, D., Ebel, P., Vogelsang, A.: Explaining the unex- plainable: The impact of misleading explanations on trust in unreliable predictions for hardly assessable tasks. In: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. pp. 36–46 (2024)

2024
[25]

In: Proceedings of the Second International Symposium on Trustworthy Autonomous Systems

Seabrooke,T.,Schneiders,E.,Dowthwaite,L.,Krook,J.,Leesakul,N.,Clos, J., Maior, H., Fischer, J.: A survey of lay people’s willingness to generate le- gal advice using large language models (llms). In: Proceedings of the Second International Symposium on Trustworthy Autonomous Systems. pp. 1–5 (2024)

2024
[26]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

Tonmoy, S.M.T.I., Zaman, S.M.M., Jain, V., Rani, A., Rawte, V., Chadha, A., Das, A.: A Comprehensive Survey of Hallucination Mitigation Tech- niques in Large Language Models (Jan 2024). https://doi.org/10.48550/ arXiv.2401.01313, http://arxiv.org/abs/2401.01313, arXiv:2401.01313 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Turpin, M., Michael, J., Perez, E., Bowman, S.: Language Models Don’t Al- ways Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems36, 74952– 74965 (Dec 2023), https://proceedings.neurips.cc/paper_files/paper/2023/ hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html

2023
[28]

Meng, C., Choi, K., Song, J., and Ermon, S

Wallat, J., Heuss, M., Rijke, M.d., Anand, A.: Correctness is not Faith- fulness in RAG Attributions (Dec 2024). https://doi.org/10.48550/arXiv. 2412.18004, http://arxiv.org/abs/2412.18004, arXiv:2412.18004 [cs]

work page internal anchor Pith review doi:10.48550/arxiv 2024
[29]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J.: In- terpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Nov 2022). https://doi.org/10.48550/arXiv.2211.00593, http://arxiv.org/abs/2211.00593, arXiv:2211.00593 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.00593 2022
[30]

https://doi.org/10.48550/arXiv.2503.06269, http://arxiv.org/abs/ 2503.06269, arXiv:2503.06269 [cs] 18 I

Winninger, T., Addad, B., Kapusta, K.: Using Mechanistic Interpretabil- ity to Craft Adversarial Attacks against Large Language Models (May 2025). https://doi.org/10.48550/arXiv.2503.06269, http://arxiv.org/abs/ 2503.06269, arXiv:2503.06269 [cs] 18 I. van Dort and M. Heuss

work page doi:10.48550/arxiv.2503.06269 2025
[31]

Health Care Science2(4), 255–263 (2023)

Yang, R., Tan, T.F., Lu, W., Thirunavukarasu, A.J., Ting, D.S.W., Liu, N.: Large language models in health care: Development, applications, and challenges. Health Care Science2(4), 255–263 (2023)

2023
[32]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: HotpotQA: A Dataset for Diverse, Explainable Multi- hop Question Answering (Sep 2018). https://doi.org/10.48550/arXiv.1809. 09600, http://arxiv.org/abs/1809.09600, arXiv:1809.09600 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809 2018

[1] [1]

https://doi.org/10.48550/ arXiv.2311.01463, http://arxiv.org/abs/2311.01463, arXiv:2311.01463 [cs]

Ahmad, M.A., Yaramis, I., Roy, T.D.: Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI (Sep 2023). https://doi.org/10.48550/ arXiv.2311.01463, http://arxiv.org/abs/2311.01463, arXiv:2311.01463 [cs]

work page arXiv 2023

[2] [2]

https://doi.org/10.48550/ARXIV.2212.08037, https: //arxiv.org/abs/2212.08037, publisher: arXiv Version Number: 2

Bohnet, B., Tran, V.Q., Verga, P., Aharoni, R., Andor, D., Soares, L.B., Ciaramita, M., Eisenstein, J., Ganchev, K., Herzig, J., Hui, K., Kwiatkowski, T., Ma, J., Ni, J., Saralegui, L.S., Schuster, T., Cohen, W.W., Collins, M., Das, D., Metzler, D., Petrov, S., Webster, K.: Attributed Question Answering: Evaluation and Modeling for Attributed Large Lan- g...

work page doi:10.48550/arxiv.2212.08037 2022

[3] [3]

Reasoning Models Don't Always Say What They Think

Chen,Y.,Benton,J.,Radhakrishnan,A.,Uesato,J.,Denison,C.,Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V., Bowman, S.R., Leike, J., Kaplan, J., Perez, E.: Reasoning Models Don’t Always Say What They Think (May 2025). https://doi.org/10.48550/arXiv.2505.05410, http://arxiv.org/abs/2505.05410, arXiv:2505.05410 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.05410 2025

[4] [4]

https://doi.org/10.48550/ arXiv.2409.00729, http://arxiv.org/abs/2409.00729, arXiv:2409.00729 [cs]

Cohen-Wang, B., Shah, H., Georgiev, K., Madry, A.: ContextCite: Attribut- ing Model Generation to Context (Sep 2024). https://doi.org/10.48550/ arXiv.2409.00729, http://arxiv.org/abs/2409.00729, arXiv:2409.00729 [cs]

work page arXiv 2024

[5] [5]

Gao,T.,Yen,H.,Yu,J.,Chen,D.:EnablingLargeLanguageModelstoGen- erateTextwithCitations.In:Bouamor,H.,Pino,J.,Bali,K.(eds.)Proceed- ingsofthe2023ConferenceonEmpiricalMethodsinNaturalLanguagePro- cessing. pp. 6465–6488. Association for Computational Linguistics, Singa- pore (Dec 2023). https://doi.org/10.18653/v1/2023.emnlp-main.398, https: //aclanthology.org...

work page doi:10.18653/v1/2023.emnlp-main.398 2023

[6] [6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-Augmented Generation for Large Language Models: A Survey (Mar 2024). https://doi.org/10.48550/arXiv.2312.10997, http://arxiv.org/abs/2312.10997, arXiv:2312.10997 version: 5

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024

[7] [7]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A.,Hinsvark,A.,Rao,A.,Zhang,A.,Rodriguez,A.,Gregerson,A.,Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[8] [8]

(eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Jacovi, A., Goldberg, Y.: Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4198–4205. Association forComputationalLinguistics,Online(Jul2020).https://doi.or...

2020

[9] [9]

Learning and individual differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fis- cher, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chat- gpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

2023

[10] [10]

AMIA Annual Symposium Proceedings2006, 469–473 (2006), https://www.ncbi.nlm.nih

Lee, M., Cimino, J., Zhu, H.R., Sable, C., Shanker, V., Ely, J., Yu, H.: Be- yond Information Retrieval—Medical Question Answering. AMIA Annual Symposium Proceedings2006, 469–473 (2006), https://www.ncbi.nlm.nih. 16 I. van Dort and M. Heuss gov/pmc/articles/PMC1839371/

2006

[11] [11]

In: Advances in Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/ ...

2020

[12] [12]

https://doi.org/10.48550/arXiv.2505.16415, http://arxiv.org/ abs/2505.16415, arXiv:2505.16415 [cs]

Li, R., Chen, C., Hu, Y., Gao, Y., Wang, X., Yilmaz, E.: Attribut- ing Response to Context: A Jensen-Shannon Divergence Driven Mecha- nistic Study of Context Attribution in Retrieval-Augmented Generation (May 2025). https://doi.org/10.48550/arXiv.2505.16415, http://arxiv.org/ abs/2505.16415, arXiv:2505.16415 [cs]

work page doi:10.48550/arxiv.2505.16415 2025

[13] [13]

https://doi.org/10.48550/arXiv.2304.09848, http:// arxiv.org/abs/2304.09848, arXiv:2304.09848 [cs]

Liu,N.F.,Zhang,T.,Liang,P.:EvaluatingVerifiabilityinGenerativeSearch Engines (Oct 2023). https://doi.org/10.48550/arXiv.2304.09848, http:// arxiv.org/abs/2304.09848, arXiv:2304.09848 [cs]

work page doi:10.48550/arxiv.2304.09848 2023

[14] [15]

https://doi.org/10.48550/arXiv.2405.20362, http: //arxiv.org/abs/2405.20362, arXiv:2405.20362 [cs]

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Re- search Tools (May 2024). https://doi.org/10.48550/arXiv.2405.20362, http: //arxiv.org/abs/2405.20362, arXiv:2405.20362 [cs]

work page doi:10.48550/arxiv.2405.20362 2024

[15] [16]

In: Rogers, A., Boyd- Graber, J., Okazaki, N

Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., Hajishirzi, H.: When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers). pp. 9802–98...

work page doi:10.18653/v1/2023.acl-long.546 2023

[16] [17]

Reuters (Jun 2023), https://www.reuters.com/legal/ new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/

Merken, S.: New York lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters (Jun 2023), https://www.reuters.com/legal/ new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/

2023

[17] [18]

Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences

Miller, T., Howe, P., Sonenberg, L.: Explainable AI: beware of inmates run- ning the asylum or: How I learnt to stop worrying and love the social and be- havioural sciences. CoRRabs/1712.00547(2017), http://arxiv.org/abs/ 1712.00547

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [19]

Annals of Internal Medicine (Jun 2025)

Modi, N.D., Menz, B.D., Awaty, A.A., Alex, C.A., Logan, J.M., McKinnon, R.A., Rowland, A., Bacchi, S., Gradon, K., Sorich, M.J., Hopkins, A.M.: Assessing the System-Instruction Vulnerabilities of Large Language Mod- els to Malicious Conversion Into Health Disinformation Chatbots. Annals of Internal Medicine (Jun 2025). https://doi.org/10.7326/ANNALS-24-03...

work page doi:10.7326/annals-24-03933 2025

[19] [20]

Muller, B., Wieting, J., Clark, J., Kwiatkowski, T., Ruder, S., Soares, L., Aharoni, R., Herzig, J., Wang, X.: Evaluating and Modeling Attribution for How Do LLMs Cite? 17 Cross-LingualQuestionAnswering.In:Bouamor,H.,Pino,J.,Bali,K.(eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 144–157. Association for...

work page doi:10.18653/v1/2023.emnlp-main.10 2023

[20] [21]

Nanda, N., Bloom, J.: TransformerLens (2022), https://github.com/ TransformerLensOrg/TransformerLens

2022

[21] [22]

https://doi.org/10.48550/ arXiv.2405.17980, http://arxiv.org/abs/2405.17980, arXiv:2405.17980 [cs]

Phukan, A., Somasundaram, S., Saxena, A., Goswami, K., Srinivasan, B.V.: Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering (May 2024). https://doi.org/10.48550/ arXiv.2405.17980, http://arxiv.org/abs/2405.17980, arXiv:2405.17980 [cs]

work page arXiv 2024

[22] [23]

Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting

Qi, J., Sarti, G., Fernández, R., Bisazza, A.: Model Internals-based An- swer Attribution for Trustworthy Retrieval-Augmented Generation. In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 6037–6053 (2024). https://doi.org/10.18653/v1/2024. emnlp-main.347, http://arxiv.org/abs/2406.13663, arXiv:2406.13663 [cs]

work page doi:10.18653/v1/2024 2024

[23] [24]

In: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization

Sadeghi, M., Pöttgen, D., Ebel, P., Vogelsang, A.: Explaining the unex- plainable: The impact of misleading explanations on trust in unreliable predictions for hardly assessable tasks. In: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. pp. 36–46 (2024)

2024

[24] [25]

In: Proceedings of the Second International Symposium on Trustworthy Autonomous Systems

Seabrooke,T.,Schneiders,E.,Dowthwaite,L.,Krook,J.,Leesakul,N.,Clos, J., Maior, H., Fischer, J.: A survey of lay people’s willingness to generate le- gal advice using large language models (llms). In: Proceedings of the Second International Symposium on Trustworthy Autonomous Systems. pp. 1–5 (2024)

2024

[25] [26]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

Tonmoy, S.M.T.I., Zaman, S.M.M., Jain, V., Rani, A., Rawte, V., Chadha, A., Das, A.: A Comprehensive Survey of Hallucination Mitigation Tech- niques in Large Language Models (Jan 2024). https://doi.org/10.48550/ arXiv.2401.01313, http://arxiv.org/abs/2401.01313, arXiv:2401.01313 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Turpin, M., Michael, J., Perez, E., Bowman, S.: Language Models Don’t Al- ways Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems36, 74952– 74965 (Dec 2023), https://proceedings.neurips.cc/paper_files/paper/2023/ hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html

2023

[27] [28]

Meng, C., Choi, K., Song, J., and Ermon, S

Wallat, J., Heuss, M., Rijke, M.d., Anand, A.: Correctness is not Faith- fulness in RAG Attributions (Dec 2024). https://doi.org/10.48550/arXiv. 2412.18004, http://arxiv.org/abs/2412.18004, arXiv:2412.18004 [cs]

work page internal anchor Pith review doi:10.48550/arxiv 2024

[28] [29]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J.: In- terpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Nov 2022). https://doi.org/10.48550/arXiv.2211.00593, http://arxiv.org/abs/2211.00593, arXiv:2211.00593 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.00593 2022

[29] [30]

https://doi.org/10.48550/arXiv.2503.06269, http://arxiv.org/abs/ 2503.06269, arXiv:2503.06269 [cs] 18 I

Winninger, T., Addad, B., Kapusta, K.: Using Mechanistic Interpretabil- ity to Craft Adversarial Attacks against Large Language Models (May 2025). https://doi.org/10.48550/arXiv.2503.06269, http://arxiv.org/abs/ 2503.06269, arXiv:2503.06269 [cs] 18 I. van Dort and M. Heuss

work page doi:10.48550/arxiv.2503.06269 2025

[30] [31]

Health Care Science2(4), 255–263 (2023)

Yang, R., Tan, T.F., Lu, W., Thirunavukarasu, A.J., Ting, D.S.W., Liu, N.: Large language models in health care: Development, applications, and challenges. Health Care Science2(4), 255–263 (2023)

2023

[31] [32]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: HotpotQA: A Dataset for Diverse, Explainable Multi- hop Question Answering (Sep 2018). https://doi.org/10.48550/arXiv.1809. 09600, http://arxiv.org/abs/1809.09600, arXiv:1809.09600 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809 2018