Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Changjiang Gao; Min Zhang; Peng Hu; Renfei Dang; Shujian Huang; Zhejian Lai

arxiv: 2511.02626 · v3 · submitted 2025-11-04 · 💻 cs.CL

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Renfei Dang , Peng Hu , Zhejian Lai , Changjiang Gao , Min Zhang , Shujian Huang This is my paper

Pith reviewed 2026-05-18 01:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM hallucinationsnew knowledge fine-tuningattention mechanismsfactual errorsinterpretability analysisknowledge typesBiography-Reasoning dataset

0 comments

The pith

Fine-tuning on new knowledge weakens LLMs' attention to key question entities and raises factual hallucination risk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create a Biography-Reasoning dataset to track how fine-tuning LLMs on fresh facts triggers hallucinations on both new and previously known information. They test across knowledge question answering and reasoning tasks and find that hallucinations spread beyond the new material. The key driver turns out to be the degree of unfamiliarity within one knowledge category rather than the sheer volume of new data. Interpretability checks show that new-knowledge training reduces focus on important entities in the input question, causing the model to lean too heavily on surrounding text. Reintroducing a modest amount of already-known facts in later training stages reverses the attention shift and lowers hallucination rates, while the disrupted patterns can carry over to similar wording in other contexts.

Core claim

When LLMs are fine-tuned on datasets where a knowledge type consists entirely of new facts, attention to the main entities in the input question drops. This leads the model to rely more on surrounding context and produce more factual errors even on information it previously handled correctly. The effect appears across both QA and reasoning tasks and extends to evaluation items outside the new knowledge. Reintroducing a small share of known facts during the final training phase brings attention back to the key entities and cuts the hallucination rate. The same attention disruption spreads to lexically similar contexts, allowing hallucinations to propagate beyond the original task.

What carries the argument

Attention weight shifts to key entities in the input question, tracked via interpretability analysis during fine-tuning on new versus mixed knowledge.

If this is right

Hallucinations appear on tasks with new knowledge and also spread to other evaluation tasks.
Unfamiliarity concentrated in one knowledge type produces stronger hallucination effects than a uniform mix of new knowledge.
Disrupted attention patterns transfer to lexically similar contexts and extend hallucination behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interleaving small amounts of known facts throughout training could serve as a practical way to limit attention shifts.
The same attention-reduction pattern might appear when fine-tuning on new material in domains such as code or mathematics.
Monitoring attention to key entities during training could become a simple early-warning signal for rising hallucination risk.

Load-bearing premise

The Biography-Reasoning dataset and its knowledge-type partitions isolate the effect of new knowledge without confounding shifts in overall data distribution or task difficulty.

What would settle it

Measuring no drop in attention weights on key entities after new-knowledge fine-tuning, or observing that reintroducing known knowledge fails to reduce hallucinations on familiar facts.

Figures

Figures reproduced from arXiv: 2511.02626 by Changjiang Gao, Min Zhang, Peng Hu, Renfei Dang, Shujian Huang, Zhejian Lai.

**Figure 1.** Figure 1: The impact of learning new knowledge on attention patterns and hallucination behavior. Training a model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Performance under two settings with different [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Attention score on the key entity name across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy and attention score changes with [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy and attention score changes when [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance and attention score changes un [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Performance with different proportions of [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy and attention score changes with [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Performance and attention score changes under Shuffled and KnownPatch (with 20% injection ratio) settings. QA represents the average across the four QA test sets, and error bars indicate standard deviations. and order of few shots can significantly affect the model’s performance (Lu et al., 2022; Zhao et al., 2021), and we need to rule out this influence. The results are shown in Tables 14, 15, 16 and 17.… view at source ↗

**Figure 12.** Figure 12: KnownPatch on reasoning tasks with 5% injection ratio. All experiments trained for 3 epoch [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: KnownPatch on reasoning tasks with 10% injection ratio. All experiments trained for 3 epoch. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: KnownPatch on reasoning tasks with 20% injection ratio. All experiments trained for 3 epoch [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Performance and attention score changes when learning new knowledge in QA tasks, and after applying KnownPatch (with 10% known data). QA represents the average across the four QA test sets, and error bars indicate standard deviations [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Performance and attention score changes when learning new knowledge in QA tasks, and after applying KnownPatch (with 5% known data). QA represents the average across the four QA test sets, and error bars indicate standard deviations. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 18.** Figure 18: Llama-3.2-1B model’s attention score on the [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Llama-3.2-1B model’s performance and at [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 17.** Figure 17: The impact of learning new knowledge in reasoning tasks on the average performance across different groups (on the Llama-3.2-1B model). We also perform the same interpretability analysis as Section 5 and Appendix D on the Llama3.2-1B model. Based on the results of [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 20.** Figure 20: The impact of learning new knowledge in reasoning tasks on the average performance of different groups (on the Qwen3-8B-Base model) [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 23.** Figure 23: The impact of learning new knowledge in reasoning tasks on the average performance of different groups (on the Qwen2.5-32B model). We perform the same analysis as Section 5 and Appendix D on the Qwen2.5-32B model. Based on the results of [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗

**Figure 24.** Figure 24: Qwen2.5-32B model’s attention score on the [PITH_FULL_IMAGE:figures/full_fig_p023_24.png] view at source ↗

**Figure 25.** Figure 25: Qwen2.5-32B model’s performance and attention score changes when learning new knowledge in QA tasks, and after applying KnownPatch (with 20% known data). QA represents the average across the four QA test sets, and error bars indicate standard deviations. from known knowledge in the training set, and about 50% accuracy on those constructed from unknown knowledge. When training is extended to 20 epochs, t… view at source ↗

**Figure 27.** Figure 27: The impact of learning new knowledge in reasoning tasks on the average performance of different groups. All experiments trained for 1 epoch. H.2 5 Epochs [PITH_FULL_IMAGE:figures/full_fig_p024_27.png] view at source ↗

**Figure 30.** Figure 30: Accuracy and attention score changes when [PITH_FULL_IMAGE:figures/full_fig_p025_30.png] view at source ↗

**Figure 28.** Figure 28: Performance of KnownPatch on reasoning task when injecting 20% known data. The value here represents the accuracy percentage of this model compared to the fully known baseline model. All experiments trained for 1 epoch [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗

**Figure 31.** Figure 31: Accuracy and attention score changes with [PITH_FULL_IMAGE:figures/full_fig_p025_31.png] view at source ↗

**Figure 29.** Figure 29: KnownPatch (missing one knowledge type) on QA tasks with an injection ratio of 20%. All experiments trained for 1 epoch. STQA DTQA Wiki -53.46 (± 6.68) -1.79 (± 2.05) -13.75 (± 10.60) [PITH_FULL_IMAGE:figures/full_fig_p025_29.png] view at source ↗

**Figure 33.** Figure 33: Performance in QA tasks under two settings [PITH_FULL_IMAGE:figures/full_fig_p026_33.png] view at source ↗

**Figure 34.** Figure 34: The impact of learning new knowledge in reasoning tasks on the average performance of different groups. All experiments trained for 5 epoch. ports the accuracy and attention score changes after learning different proportions of unknown knowledge; [PITH_FULL_IMAGE:figures/full_fig_p026_34.png] view at source ↗

**Figure 36.** Figure 36: KnownPatch (missing one knowledge type) on QA tasks with an injection ratio of 20%. All experiments trained for 5 epoch [PITH_FULL_IMAGE:figures/full_fig_p026_36.png] view at source ↗

**Figure 37.** Figure 37: Accuracy and attention score changes when [PITH_FULL_IMAGE:figures/full_fig_p026_37.png] view at source ↗

**Figure 41.** Figure 41: The impact of learning new knowledge in reasoning tasks on the average performance of different groups. All experiments trained for 20 epoch [PITH_FULL_IMAGE:figures/full_fig_p027_41.png] view at source ↗

**Figure 39.** Figure 39: Performance and attention score changes when learning new knowledge in QA tasks, and after applying KnownPatch (with 20% known data). QA represents the average across the four QA test sets, and error bars indicate standard deviations. All experiments trained for 5 epoch [PITH_FULL_IMAGE:figures/full_fig_p027_39.png] view at source ↗

**Figure 40.** Figure 40: Performance in QA tasks under two settings [PITH_FULL_IMAGE:figures/full_fig_p027_40.png] view at source ↗

**Figure 44.** Figure 44: Accuracy and attention score changes when [PITH_FULL_IMAGE:figures/full_fig_p028_44.png] view at source ↗

**Figure 45.** Figure 45: Accuracy and attention score changes with [PITH_FULL_IMAGE:figures/full_fig_p028_45.png] view at source ↗

**Figure 46.** Figure 46: Performance and attention score changes when learning new knowledge in QA tasks, and after applying KnownPatch (with 20% known data). QA represents the average across the four QA test sets, and error bars indicate standard deviations. All experiments trained for 20 epoch. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_46.png] view at source ↗

read the original abstract

Prior works have shown that fine-tuning on new knowledge can induce factual hallucinations in large language models (LLMs), leading to incorrect outputs when evaluated on previously known information. However, the specific manifestations of such hallucination and its underlying mechanisms remain insufficiently understood. Our work addresses this gap by designing a controlled dataset \textit{Biography-Reasoning}, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that hallucinations not only severely affect tasks involving newly introduced knowledge, but also propagate to other evaluation tasks. Moreover, when fine-tuning on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit elevated hallucination tendencies. This suggests that the degree of unfamiliarity within a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations. Through interpretability analysis, we show that learning new knowledge weakens the model's attention to key entities in the input question, leading to an over-reliance on surrounding context and a higher risk of hallucination. Conversely, reintroducing a small amount of known knowledge during the later stages of training restores attention to key entities and substantially mitigates hallucination behavior. Finally, we demonstrate that disrupted attention patterns can propagate across lexically similar contexts, facilitating the spread of hallucinations beyond the original task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines how fine-tuning LLMs on new knowledge induces factual hallucinations on previously known facts. Using a custom Biography-Reasoning dataset, the authors perform fine-grained experiments across knowledge-type partitions and two task types (knowledge QA and reasoning). They report that hallucinations affect new-knowledge tasks and propagate to other evaluations; that the degree of unfamiliarity within a specific knowledge type is a stronger driver than overall new-knowledge proportion; that new-knowledge training weakens attention to key input entities (increasing context reliance); and that reintroducing small amounts of known knowledge later in training restores attention and reduces hallucinations. Disrupted attention patterns are also shown to propagate across lexically similar contexts.

Significance. If the mechanistic account holds, the work supplies a concrete, testable link between new-knowledge fine-tuning, attention redistribution, and hallucination propagation, together with a practical mitigation strategy. The controlled dataset construction and cross-task, cross-type analysis are positive features; the attention-based interpretability results, if quantitatively supported, would constitute a falsifiable prediction about training dynamics that could inform future alignment and continual-learning methods.

major comments (2)

[§3] §3 (Biography-Reasoning Dataset): The claim that the dataset isolates the effect of novelty rests on controlled construction and knowledge-type partitions, yet the manuscript provides no explicit verification that lexical overlap, entity frequency, or reasoning depth are balanced across new-knowledge versus known-knowledge splits. Without these controls, observed attention shifts and hallucination rates could be driven by distributional differences rather than unfamiliarity per se; this is load-bearing for the central causal interpretation.
[§5] §5 (Interpretability Analysis): The key mechanistic claim—that learning new knowledge weakens attention to key entities—is supported only by qualitative attention-map observations. No quantitative effect sizes, average attention scores with standard errors, or statistical tests comparing conditions are reported, nor are ablation controls shown to rule out confounding changes in token distribution or task difficulty. This weakens the evidential basis for the attention-weakening account and the subsequent mitigation result.

minor comments (2)

[Abstract and §4] The abstract and results sections would benefit from explicit reporting of effect sizes, confidence intervals, and the number of runs for all hallucination-rate comparisons.
[§3 and §4] Notation for knowledge-type partitions (e.g., “entirely new knowledge” vs. mixed) should be defined once in a table or equation and used consistently to avoid ambiguity when discussing propagation effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment point-by-point below. Where the concerns identify opportunities to strengthen the evidential basis, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Biography-Reasoning Dataset): The claim that the dataset isolates the effect of novelty rests on controlled construction and knowledge-type partitions, yet the manuscript provides no explicit verification that lexical overlap, entity frequency, or reasoning depth are balanced across new-knowledge versus known-knowledge splits. Without these controls, observed attention shifts and hallucination rates could be driven by distributional differences rather than unfamiliarity per se; this is load-bearing for the central causal interpretation.

Authors: We thank the referee for this observation. The Biography-Reasoning dataset was constructed by partitioning knowledge types and selecting entities and reasoning chains to isolate novelty effects, with explicit efforts to maintain comparable entity selection and reasoning structures across splits. Nevertheless, we agree that making balance explicit would better support the causal interpretation. In the revised manuscript we will add quantitative verification, including n-gram overlap statistics for lexical similarity, histograms of entity frequencies, and counts of reasoning steps or dependency depth, all compared across new-knowledge and known-knowledge partitions. These additions will be placed in a new subsection of §3 and the corresponding appendix. revision: yes
Referee: [§5] §5 (Interpretability Analysis): The key mechanistic claim—that learning new knowledge weakens attention to key entities—is supported only by qualitative attention-map observations. No quantitative effect sizes, average attention scores with standard errors, or statistical tests comparing conditions are reported, nor are ablation controls shown to rule out confounding changes in token distribution or task difficulty. This weakens the evidential basis for the attention-weakening account and the subsequent mitigation result.

Authors: The referee correctly identifies that the current interpretability results are presented qualitatively. To address this, the revised manuscript will augment §5 with quantitative support: we will report mean attention weights on key entities (with standard errors) for models trained on new-knowledge versus known-knowledge conditions, together with statistical comparisons (paired t-tests or non-parametric equivalents). We will also add ablation experiments that hold token distributions and task difficulty constant while varying only the novelty of the fine-tuning data. These quantitative results and controls will be included to provide a stronger, falsifiable basis for the attention-weakening mechanism and the mitigation findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper conducts an empirical analysis on a controlled Biography-Reasoning dataset with fine-grained partitions across knowledge types and tasks, reporting direct observations of attention map changes and hallucination rates. The central mechanism (new knowledge weakening attention to key entities, mitigated by reintroducing known knowledge) is presented as a measured outcome rather than a definitional or fitted result that presupposes itself. No equations, self-citations, or ansatzes are shown to reduce the reported findings to inputs by construction, and the work remains self-contained against external benchmarks without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions from LLM interpretability literature that attention weights are meaningful proxies for entity focus and that the constructed dataset faithfully separates new versus known knowledge.

axioms (2)

domain assumption Attention weights in transformer layers reflect the model's focus on input entities during generation.
Invoked when linking weakened attention to key entities with increased hallucination risk.
domain assumption The Biography-Reasoning dataset partitions knowledge types without introducing unintended distributional shifts.
Required for attributing hallucination differences to unfamiliarity within a knowledge type rather than other factors.

pith-pipeline@v0.9.0 · 5789 in / 1339 out tokens · 64710 ms · 2026-05-18T01:20:02.810237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Allen-Zhu, Z.; and Li, Y. 2024. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. In Forty-first International Conference on Machine Learning

work page 2024
[2]

Allen-Zhu , Z.; and Li, Y. 2025. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws . In Proceedings of the 13th International Conference on Learning Representations, ICLR '25. Full version available at https://ssrn.com/abstract=5250617

work page 2025
[3]

Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self- RAG : Learning to Retrieve, Generate, and Critique through Self-Reflection. In The Twelfth International Conference on Learning Representations

work page 2024
[4]

Cohen, R.; Geva, M.; Berant, J.; and Globerson, A. 2023. Crawling the internal knowledge-base of language models. arXiv preprint arXiv:2301.12810

work page arXiv 2023
[5]

Duwal, S. 2025. MKA: Leveraging Cross-Lingual Consensus for Model Abstention. arXiv preprint arXiv:2503.23687

work page arXiv 2025
[6]

Feng, S.; Shi, W.; Bai, Y.; Balachandran, V.; He, T.; and Tsvetkov, Y. 2023. Knowledge card: Filling llms' knowledge gaps with plug-in specialized language models. arXiv preprint arXiv:2305.09955

work page arXiv 2023
[7]

Gekhman, Z.; Yona, G.; Aharoni, R.; Eyal, M.; Feder, A.; Reichart, R.; and Herzig, J. 2024. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904

work page arXiv 2024
[8]

Ghosal, G.; Hashimoto, T.; and Raghunathan, A. 2024. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785

work page arXiv 2024
[9]

Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Gu, Y.; Zhang, W.; Lyu, C.; Lin, D.; and Chen, K. 2025. Mask-dpo: Generalizable fine-grained factuality alignment of llms. arXiv preprint arXiv:2503.02846

work page arXiv 2025
[11]

Kang, K.; Wallace, E.; Tomlin, C.; Kumar, A.; and Levine, S. 2024. Unfamiliar finetuning examples control how language models hallucinate. arXiv preprint arXiv:2403.05612

work page arXiv 2024
[12]

Li, J.; and Ng, H. T. 2025. The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models. arXiv preprint arXiv:2505.24630

work page arXiv 2025
[13]

Lin, S.-C.; Gao, L.; Oguz, B.; Xiong, W.; Lin, J.; Yih, W.-t.; and Chen, X. 2024. Flame: Factuality-aware alignment for large language models. Advances in Neural Information Processing Systems, 37: 115588--115614

work page 2024
[14]

V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; et al

Lin, X. V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; et al. 2023. Ra-dit: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations

work page 2023
[15]

Liu, Y.; Chang, S.; Jaakkola, T.; and Zhang, Y. 2024. Fictitious synthetic data can improve llm factuality via prerequisite learning. arXiv preprint arXiv:2410.19290

work page arXiv 2024
[16]

Lu, Y.; Bartolo, M.; Moore, A.; Riedel, S.; and Stenetorp, P. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8086--8098. Dublin, ...

work page 2022
[17]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

work page 2022
[18]

Ovadia, O.; Brief, M.; Mishaeli, M.; and Elisha, O. 2023. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934

work page arXiv 2023
[19]

Language Models as Knowledge Bases?

Petroni, F.; Rockt \"a schel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728--53741

work page 2023
[21]

Sciavolino, C.; Zhong, Z.; Lee, J.; and Chen, D. 2021. Simple Entity-Centric Questions Challenge Dense Retrievers. In Moens, M.-F.; Huang, X.; Specia, L.; and Yih, S. W.-t., eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6138--6148. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics

work page 2021
[22]

Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; and Weston, J. 2021. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567

work page arXiv 2021
[23]

A.; Vladymyrov, M.; Rueckert, U.; Kim, B.; and Sandler, M

Sun, C.; Aksitov, R.; Zhmoginov, A.; Miller, N. A.; Vladymyrov, M.; Rueckert, U.; Kim, B.; and Sandler, M. 2025. How new data permeates LLM knowledge and how to dilute it. arXiv preprint arXiv:2504.09522

work page arXiv 2025
[24]

Sun, Z.; Wang, X.; Tay, Y.; Yang, Y.; and Zhou, D. 2022. Recitation-augmented language models. arXiv preprint arXiv:2210.01296

work page arXiv 2022
[25]

Team, Q. 2024. Qwen2.5: A Party of Foundation Models

work page 2024
[26]

Team, Q. 2025. Qwen3 Technical Report. arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Vrande c i\' c , D.; and Kr\" o tzsch, M. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10): 78–85

work page 2014
[28]

V.; Zhou, D.; et al

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837

work page 2022
[29]

Wendler, C.; Veselovsky, V.; Monea, G.; and West, R. 2024. Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15366--15394. Bangkok, Thailand: Association for Comp...

work page 2024
[30]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Yadkori, Y. A.; Kuzborskij, I.; Stutz, D.; Gy \"o rgy, A.; Fisch, A.; Doucet, A.; Beloshapka, I.; Weng, W.-H.; Yang, Y.-Y.; Szepesv \'a ri, C.; et al. 2024. Mitigating llm hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563

work page arXiv 2024
[31]

Zhao, Y.; Zhang, W.; Chen, G.; Kawaguchi, K.; and Bing, L. 2024. How do Large Language Models Handle Multilingualism? In Advances in Neural Information Processing Systems (NeurIPS)

work page 2024
[32]

Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; and Singh, S. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 12697--12706. PMLR

work page 2021
[33]

Zheng, J.; Cai, X.; Qiu, S.; and Ma, Q. 2025. Spurious Forgetting in Continual Learning of Language Models. In The Thirteenth International Conference on Learning Representations

work page 2025
[34]

Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; and Ma, Y. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics

work page 2024
[35]

Zhu, R.; Jiang, Z.; Wu, J.; Ma, Z.; Song, J.; Bai, F.; Lin, D.; Wu, L.; and He, C. 2025. GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation. arXiv preprint arXiv:2502.05911

work page arXiv 2025

[1] [1]

Allen-Zhu, Z.; and Li, Y. 2024. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. In Forty-first International Conference on Machine Learning

work page 2024

[2] [2]

Allen-Zhu , Z.; and Li, Y. 2025. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws . In Proceedings of the 13th International Conference on Learning Representations, ICLR '25. Full version available at https://ssrn.com/abstract=5250617

work page 2025

[3] [3]

Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self- RAG : Learning to Retrieve, Generate, and Critique through Self-Reflection. In The Twelfth International Conference on Learning Representations

work page 2024

[4] [4]

Cohen, R.; Geva, M.; Berant, J.; and Globerson, A. 2023. Crawling the internal knowledge-base of language models. arXiv preprint arXiv:2301.12810

work page arXiv 2023

[5] [5]

Duwal, S. 2025. MKA: Leveraging Cross-Lingual Consensus for Model Abstention. arXiv preprint arXiv:2503.23687

work page arXiv 2025

[6] [6]

Feng, S.; Shi, W.; Bai, Y.; Balachandran, V.; He, T.; and Tsvetkov, Y. 2023. Knowledge card: Filling llms' knowledge gaps with plug-in specialized language models. arXiv preprint arXiv:2305.09955

work page arXiv 2023

[7] [7]

Gekhman, Z.; Yona, G.; Aharoni, R.; Eyal, M.; Feder, A.; Reichart, R.; and Herzig, J. 2024. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904

work page arXiv 2024

[8] [8]

Ghosal, G.; Hashimoto, T.; and Raghunathan, A. 2024. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785

work page arXiv 2024

[9] [9]

Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Gu, Y.; Zhang, W.; Lyu, C.; Lin, D.; and Chen, K. 2025. Mask-dpo: Generalizable fine-grained factuality alignment of llms. arXiv preprint arXiv:2503.02846

work page arXiv 2025

[11] [11]

Kang, K.; Wallace, E.; Tomlin, C.; Kumar, A.; and Levine, S. 2024. Unfamiliar finetuning examples control how language models hallucinate. arXiv preprint arXiv:2403.05612

work page arXiv 2024

[12] [12]

Li, J.; and Ng, H. T. 2025. The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models. arXiv preprint arXiv:2505.24630

work page arXiv 2025

[13] [13]

Lin, S.-C.; Gao, L.; Oguz, B.; Xiong, W.; Lin, J.; Yih, W.-t.; and Chen, X. 2024. Flame: Factuality-aware alignment for large language models. Advances in Neural Information Processing Systems, 37: 115588--115614

work page 2024

[14] [14]

V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; et al

Lin, X. V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; et al. 2023. Ra-dit: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations

work page 2023

[15] [15]

Liu, Y.; Chang, S.; Jaakkola, T.; and Zhang, Y. 2024. Fictitious synthetic data can improve llm factuality via prerequisite learning. arXiv preprint arXiv:2410.19290

work page arXiv 2024

[16] [16]

Lu, Y.; Bartolo, M.; Moore, A.; Riedel, S.; and Stenetorp, P. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8086--8098. Dublin, ...

work page 2022

[17] [17]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

work page 2022

[18] [18]

Ovadia, O.; Brief, M.; Mishaeli, M.; and Elisha, O. 2023. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934

work page arXiv 2023

[19] [19]

Language Models as Knowledge Bases?

Petroni, F.; Rockt \"a schel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728--53741

work page 2023

[21] [21]

Sciavolino, C.; Zhong, Z.; Lee, J.; and Chen, D. 2021. Simple Entity-Centric Questions Challenge Dense Retrievers. In Moens, M.-F.; Huang, X.; Specia, L.; and Yih, S. W.-t., eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6138--6148. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics

work page 2021

[22] [22]

Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; and Weston, J. 2021. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567

work page arXiv 2021

[23] [23]

A.; Vladymyrov, M.; Rueckert, U.; Kim, B.; and Sandler, M

Sun, C.; Aksitov, R.; Zhmoginov, A.; Miller, N. A.; Vladymyrov, M.; Rueckert, U.; Kim, B.; and Sandler, M. 2025. How new data permeates LLM knowledge and how to dilute it. arXiv preprint arXiv:2504.09522

work page arXiv 2025

[24] [24]

Sun, Z.; Wang, X.; Tay, Y.; Yang, Y.; and Zhou, D. 2022. Recitation-augmented language models. arXiv preprint arXiv:2210.01296

work page arXiv 2022

[25] [25]

Team, Q. 2024. Qwen2.5: A Party of Foundation Models

work page 2024

[26] [26]

Team, Q. 2025. Qwen3 Technical Report. arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Vrande c i\' c , D.; and Kr\" o tzsch, M. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10): 78–85

work page 2014

[28] [28]

V.; Zhou, D.; et al

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837

work page 2022

[29] [29]

Wendler, C.; Veselovsky, V.; Monea, G.; and West, R. 2024. Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15366--15394. Bangkok, Thailand: Association for Comp...

work page 2024

[30] [30]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Yadkori, Y. A.; Kuzborskij, I.; Stutz, D.; Gy \"o rgy, A.; Fisch, A.; Doucet, A.; Beloshapka, I.; Weng, W.-H.; Yang, Y.-Y.; Szepesv \'a ri, C.; et al. 2024. Mitigating llm hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563

work page arXiv 2024

[31] [31]

Zhao, Y.; Zhang, W.; Chen, G.; Kawaguchi, K.; and Bing, L. 2024. How do Large Language Models Handle Multilingualism? In Advances in Neural Information Processing Systems (NeurIPS)

work page 2024

[32] [32]

Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; and Singh, S. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 12697--12706. PMLR

work page 2021

[33] [33]

Zheng, J.; Cai, X.; Qiu, S.; and Ma, Q. 2025. Spurious Forgetting in Continual Learning of Language Models. In The Thirteenth International Conference on Learning Representations

work page 2025

[34] [34]

Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; and Ma, Y. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics

work page 2024

[35] [35]

Zhu, R.; Jiang, Z.; Wu, J.; Ma, Z.; Song, J.; Bai, F.; Lin, D.; Wu, L.; and He, C. 2025. GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation. arXiv preprint arXiv:2502.05911

work page arXiv 2025