pith. sign in

arxiv: 2506.17585 · v3 · submitted 2025-06-21 · 💻 cs.AI · cs.CL· cs.LG

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords retrieval-free citationknowledge attributioncontinual pretrainingactive indexingdocument identifierssynthetic data augmentationbidirectional trainingCitePretrainBench
0
0 comments X

The pith

LLMs can learn reliable citations to their own pretraining documents without any external retrieval at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that revising continual pretraining to create persistent bindings between facts and document identifiers allows models to attribute their answers directly to sources seen during training. This matters because it removes the need for test-time retrieval, cutting latency, infrastructure costs, and exposure to retrieval errors while still producing verifiable outputs. The key technique augments training data with diverse restatements of each fact plus bidirectional source-to-fact and fact-to-source examples, producing more robust bindings than simply tagging documents with identifiers. Experiments on a new benchmark mixing Wikipedia, Common Crawl, arXiv, and novel documents confirm gains of up to 30.2 percent citation precision on both single-fact and multi-fact tasks, with further gains as the volume of augmented data increases.

Core claim

Active Indexing during continual pretraining binds factual knowledge to persistent document identifiers by training on synthetic augmentations that restate each fact in diverse compositional forms and enforce bidirectional mappings between sources and facts. After subsequent instruction tuning, the resulting models generate content from cited sources and attribute their own answers with higher precision than a passive baseline that merely appends identifiers, with the advantage holding across short-form and long-form citation tasks and scaling as augmented data volume grows.

What carries the argument

Active Indexing, which augments pretraining data with compositional restatements and bidirectional source-to-fact training to create generalizable bindings between facts and document identifiers.

If this is right

  • Citation precision continues to rise as the amount of augmented synthetic data scales to at least 16 times the original token count.
  • Internal citations improve robustness when the model is later given noisy external retrieval results.
  • The same binding approach supports both single-fact short answers and multi-fact long-form generation.
  • The method works across model sizes tested, including 3B and 7B Qwen-2.5 variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Removing the external retriever could simplify deployment of citation systems in resource-constrained environments.
  • Tying outputs to specific training documents may offer a route to audit or edit model knowledge by editing or removing source documents.
  • The bidirectional training pattern could be adapted to other attribution tasks such as tracing reasoning steps back to training examples.

Load-bearing premise

The synthetic data augmentations will create bindings that generalize to real user queries rather than only matching the synthetic distribution.

What would settle it

Citation precision on a held-out set of natural user queries falls below the synthetic benchmark results by more than the gap seen between active and passive indexing.

Figures

Figures reproduced from arXiv: 2506.17585 by Bhuwan Dhingra, Jian Pei, Manzil Zaheer, Sanxing Chen, Yukun Huang.

Figure 1
Figure 1. Figure 1: CitePretrain Framework. We construct a diverse corpus comprising Wikipedia, ArXiv, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scaling Curve of Combining Backward and For￾ward on RepliQA Diverse Fact Representations Help Citation Active Indexing generates diverse fact variants—through paraphrasing, composition, and interaction—all tied to the same document ID. This diversity helps the model generalize and reliably cite, improving both mem￾orization and utilization. To evaluate how diversity impacts citation ability, we study how c… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling Comparison Between Active Indexing and Passive Indexing on RepliQA [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs can be trained to reliably attribute citations to documents encountered during continual pretraining without test-time retrieval. It introduces CitePretrainBench (mixing real corpora such as Wikipedia, Common Crawl, and arXiv with novel documents) for short-form and long-form citation tasks, and proposes a two-stage method: Active Indexing via continual pretraining that augments data with diverse compositional restatements plus bidirectional (source-to-fact and fact-to-source) pairs to create source-anchored bindings, followed by instruction tuning. Experiments on Qwen-2.5-7B and 3B models show Active Indexing outperforming a Passive Indexing baseline (simple identifier appending) with citation precision gains up to 30.2%, and an upward performance trend as augmented data scales to 16x the original token count.

Significance. If the central empirical claims hold and generalize, the work offers a promising direction for retrieval-free citation in LLMs, which could reduce latency, infrastructure costs, and vulnerability to retrieval noise while complementing external retrieval. The scaling ablation showing continued gains with more augmented data is a clear strength that supports the method's viability. The introduction of CitePretrainBench also provides a useful resource for studying attribution.

major comments (2)
  1. [Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.
  2. [Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.
minor comments (2)
  1. [Abstract] Abstract: The maximum gain of 30.2% is stated without indicating the specific task, model size, or condition under which it is achieved, which would improve immediate readability of the key result.
  2. [Benchmark construction] The description of how novel documents are mixed into the benchmark and how test queries are sampled could be expanded for reproducibility, even if high-level details are present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that certain aspects can be strengthened through revisions and have outlined specific changes to be incorporated in the revised version.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.

    Authors: We agree that reporting statistical significance tests, error bars, and run-to-run variance would strengthen the reliability of the empirical results. Our original experiments used single runs due to the high computational cost of continual pretraining for the Qwen-2.5-7B and 3B models. In the revised manuscript, we will conduct additional runs with varied random seeds for the main experiments, report means and standard deviations, and include statistical significance tests (such as paired t-tests) for the citation precision gains. We will also update the abstract and results sections to reflect these details. revision: yes

  2. Referee: [Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.

    Authors: We appreciate this observation on the generalization of the bindings to natural distributions. CitePretrainBench mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents, and the evaluation queries are drawn from this mixture to reflect realistic syntactic and compositional patterns. The performance improvements on both short-form and long-form tasks, along with the scaling trend up to 16x augmented data, provide supporting evidence for transfer. However, we acknowledge that the benchmark does not explicitly isolate queries with patterns fully disjoint from the augmentations. In the revised manuscript, we will add a dedicated discussion of this limitation and suggest it as future work, while moderating the strength of the generalization claims in the relevant sections. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical training and evaluation chain

full rationale

The paper defines Active Indexing explicitly as a training augmentation strategy (diverse compositional restatements plus bidirectional source-to-fact and fact-to-source pairs) during continual pretraining, then measures citation precision against an independent Passive Indexing baseline on CitePretrainBench. Results, ablations on data scaling, and comparisons to external retrieval are reported as experimental outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No equations or uniqueness theorems are invoked that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested assumption that synthetic data can create generalizable source bindings without introducing distribution shift that harms downstream citation accuracy.

free parameters (1)
  • scale of augmented synthetic data
    Ablations vary this from 1x to 16x original tokens and report continued improvement; the exact multiplier is chosen to demonstrate the trend.

pith-pipeline@v0.9.0 · 5875 in / 1139 out tokens · 27091 ms · 2026-05-19T07:36:40.226475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

  1. [1]

    Measuring attribution in natural language generation models

    Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 2023. URL: https://aclanthology.org/2023.cl-4.2, doi:10.1162/coli_a_00486

  2. [2]

    Survey on factuality in large language models: Knowledge, retrieval and domain-specificity

    Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. ArXiv preprint, abs/2310.07521, 2023. URL: https://arxiv.org/abs/2310.07521

  3. [3]

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Mano...

  4. [4]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021

  5. [5]

    Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. Do language models know when they’re hallucinating references? In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024 , pages 912–928, St. Julian’s, Malta, 2024. Association for Computational Linguistics. URL: https://aclanthology. org/20...

  6. [6]

    Chatgpt hallucinates when attributing answers

    Guido Zuccon, Bevan Koopman, and Razia Shaik. Chatgpt hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 46–51, 2023

  7. [7]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...

  8. [8]

    Teaching language models to support answers with verified quotes

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. ArXiv preprint, abs/2203.11147, 2022. URL: https://arxiv.org/abs/2203.11147

  9. [9]

    Rethinking with retrieval: Faithful large language model inference

    Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. ArXiv preprint, abs/2301.00303, 2023. URL: https://arxiv. org/abs/2301.00303

  10. [10]

    RARR: Researching and revising what language models say, using language models

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Associati...

  11. [11]

    URL: https://aclanthology.org/2023

    Association for Computational Linguistics. URL: https://aclanthology.org/2023. acl-long.910, doi:10.18653/v1/2023.acl-long.910

  12. [12]

    How context affects language models’ factual predictions

    Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. ArXiv preprint, abs/2005.04611, 2020. URL: https://arxiv.org/abs/2005.04611

  13. [13]

    Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

    Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  14. [14]

    URL: https://openreview.net/forum?id=auKAUJZMO6

    OpenReview.net, 2024. URL: https://openreview.net/forum?id=auKAUJZMO6

  15. [15]

    Automatic evaluation of attribution by large language models

    Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology....

  16. [16]

    Enabling large language models to gen- erate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to gen- erate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore, 2023. Association for Computational Linguistics. URL: https://aclant...

  17. [17]

    Source-aware training enables knowledge attribution in language models

    Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. Source-aware training enables knowledge attribution in language models. In First Conference on Language Modeling , 2024. URL: https://openreview.net/forum?id= UPyWLwciYz

  18. [18]

    Evaluating verifiability in generative search engines

    Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology.org/2023.findings-emnlp. 467, ...

  19. [19]

    Effective large language model adaptation for improved grounding and citation generation

    Xi Ye, Ruoxi Sun, Sercan Arik, and Tomas Pfister. Effective large language model adaptation for improved grounding and citation generation. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

  20. [20]

    Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution

    Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. ArXiv preprint, abs/2307.16883, 2023. URL: https://arxiv.org/abs/2307.16883

  21. [21]

    Training language models to generate text with citations via fine-grained rewards

    Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. ArXiv preprint, abs/2402.04315, 2024. URL: https://arxiv.org/abs/2402.04315. 11

  22. [23]

    URL: https://arxiv.org/abs/2502.09604

  23. [24]

    To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts

    Yukun Huang, Sanxing Chen, Hongyi Cai, and Bhuwan Dhingra. To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview. net/forum?id=K2jOacHUlO

  24. [25]

    Recitation-augmented language models

    Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL: https://openreview. net/pdf?id=-cqvvvb-NkI

  25. [26]

    according to

    Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “according to . . . ”: Prompting language models improves quoting from pre-training data. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long ...

  26. [27]

    Generative retrieval with large language models

    Ye Wang, Xinrun Xu, Rui Xie, Wenxin Hu, and Wei Ye. Generative retrieval with large language models. ArXiv preprint, abs/2402.17010, 2024. URL: https://arxiv.org/abs/ 2402.17010

  27. [28]

    Verifiable by design: Aligning language models to quote from pre-training data

    Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, and Daniel Khashabi. Verifiable by design: Aligning language models to quote from pre-training data. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

  28. [29]

    ASQA: Factoid questions meet long-form answers

    Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. U...

  29. [30]

    KILT: a benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Belt...

  30. [31]

    ELI5: Long form question answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3558–3567, Florence, Italy, 2019. Association for Computational Linguistics. ...

  31. [32]

    CCNet: Extracting high quality monolingual datasets from web crawl data

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Jos...

  32. [33]

    Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation

    Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, and Ian Foster. Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation. ArXiv preprint, abs/2405.09939, 2024. URL: https://arxiv.org/abs/2405.09939

  33. [34]

    Repliqa: A question-answering dataset for benchmarking llms on unseen reference content

    João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar Mudumba, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Chris Pal, and Perouz Taslakian. Repliqa: A question-answering dataset for benchmarking llms on unseen reference content. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...

  34. [35]

    FreshLLMs: Refreshing large language models with search engine augmentation

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguis- tics: ACL 2024 , pages 13697–...

  35. [36]

    TrueTeacher: Learning factual consistency evaluation with large language models

    Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing , pages 2053–2070, Singapore, 2023. Association for Co...

  36. [37]

    Physics of language models: Part 3.2, knowledge manipula- tion

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=oDbiL9CLoS

  37. [38]

    Synthetic continued pretraining

    Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto. Synthetic continued pretraining. In The Thirteenth International Conference on Learning Representations,

  38. [39]

    URL: https://openreview.net/forum?id=07yvxWDSla

  39. [40]

    Generalization v.s

    Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=IQxBDLmVpT

  40. [41]

    The web is your oyster - knowledge-intensive nlp against a very large web corpus

    Aleksandra Piktus, Fabio Petroni, Yizhong Wang, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Ouguz, Edouard Grave, Wen tau Yih, and Sebastian Riedel. The web is your oyster - knowledge-intensive nlp against a very large web corpus. ArXiv preprint, abs/2112.09924, 2021. URL: https://arxiv.org/abs/2112. 09924

  41. [42]

    Cohen, and Donald Met- zler

    Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Met- zler. Transformer memory as a differentiable search index. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neu- ral Information Proces...

  42. [43]

    From matching to generation: A survey on generative information retrieval

    Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems , 2024. URL: https://api.semanticscholar.org/CorpusID: 269303210

  43. [44]

    Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks

    Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024...

  44. [45]

    TRAK: attributing model behavior at scale

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: attributing model behavior at scale. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 o...

  45. [46]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum? id=gLa96FlWwn

  46. [47]

    Zettlemoyer, and Pang Wei Koh

    Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hanna Hajishirzi, Luke S. Zettlemoyer, and Pang Wei Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation. ArXiv preprint, abs/2407.07087, 2024. URL: https://arxiv.org/abs/2407.07087

  47. [48]

    In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval

    Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors, Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4N...

  48. [49]

    The model is directly prompted to rank text titles given document content

    Natural Title (Raw): A baseline without continual pretraining. The model is directly prompted to rank text titles given document content. This tests whether pretrained LMs can match content to titles without exposure

  49. [50]

    This approach uses natural-language identifiers that align with the model’s training distribution

    Natural Titles : We perform continual pretraining where each document is appended with its human-written text title. This approach uses natural-language identifiers that align with the model’s training distribution

  50. [51]

    Documents are embedded and clustered using K-means into 10 top-level groups

    Hierarchical K-Means Integer (HKM-Integer): Instead of using random integers, we construct semantically structured integer IDs following [38]. Documents are embedded and clustered using K-means into 10 top-level groups. Each group is assigned a prefix digit. The process is recursively applied within each cluster, with each level adding a digit to the ID. ...

  51. [52]

    For each cluster, we use an LLM to generate a representative keyword based on its most salient documents

    Hierarchical LDA with Keyword Labels (HLDA-Keywords) We apply hierarchical topic modeling (LDA) to recursively cluster documents. For each cluster, we use an LLM to generate a representative keyword based on its most salient documents. The final identifier is a concatenation of these keywords along the cluster path, forming a semantic, hierarchical label....

  52. [53]

    Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality

    Keyword-First Domain Identifier (Keywords→Domain) Similar to the above, but constructed in a bottom-up manner. Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality. ID Type Acc@1 Acc@10 Natural Titles (Raw) 9.7 46.3 Natural Titles 53.3 75.3 HKM-Integer 2.0 21.7 HLDA-Keywords...

  53. [54]

    This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation

    Correct Answer with Faithful and Diverse Citations In ideal cases, the model not only produces a factually accurate and coherent answer, but also cites multiple distinct documents, each supporting a different part of the response. This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation. ...

  54. [55]

    This suggests a mismatch between content planning and citation generation

    Correct Answer but Incorrect Citations In some cases, the generated answer is factually correct and well-structured, but the cited documents are irrelevant. This suggests a mismatch between content planning and citation generation. Example: Question: Why do online communities crumble as they gain popularity? Model Answer: Communities may lose cohesion as ...

  55. [56]

    dry mouth

    Faithful Citations but Incomplete Answer Sometimes, the model successfully grounds all claims in real documents, but the final answer fails to directly address the question. 19 Example: Question: Why do so many drugs cause “dry mouth” as a side effect? Model Answer: Many drugs cause xerostomia, or dry mouth. <|Understanding Medication Side Effects: The Pr...

  56. [57]

    Title Lure

    “Title Lure” Errors in Short-form QA In short-form QA tasks, the model sometimes selects citations solely based on title relevance, even when the document content lacks the required evidence. This reflects a superficial attribute mechanism. Example: Question: How is Boston addressing the digital divide in terms of communications technology from December 2...

  57. [58]

    Near Miss

    Cross-Domain Lookalikes and “Near Miss” Citations Occasionally, the model cites from a mismatched domain—e.g., a general Wikipedia article instead of a domain-specific source like RepliQA—producing citations that superficially resemble the ground truth but lack factual alignment. Example: Question: When was the last game of Copenhagen’s basketball season ...