Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Bhuwan Dhingra; Jian Pei; Manzil Zaheer; Sanxing Chen; Yukun Huang

arxiv: 2506.17585 · v3 · submitted 2025-06-21 · 💻 cs.AI · cs.CL· cs.LG

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang , Sanxing Chen , Jian Pei , Manzil Zaheer , Bhuwan Dhingra This is my paper

Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords retrieval-free citationknowledge attributioncontinual pretrainingactive indexingdocument identifierssynthetic data augmentationbidirectional trainingCitePretrainBench

0 comments

The pith

LLMs can learn reliable citations to their own pretraining documents without any external retrieval at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that revising continual pretraining to create persistent bindings between facts and document identifiers allows models to attribute their answers directly to sources seen during training. This matters because it removes the need for test-time retrieval, cutting latency, infrastructure costs, and exposure to retrieval errors while still producing verifiable outputs. The key technique augments training data with diverse restatements of each fact plus bidirectional source-to-fact and fact-to-source examples, producing more robust bindings than simply tagging documents with identifiers. Experiments on a new benchmark mixing Wikipedia, Common Crawl, arXiv, and novel documents confirm gains of up to 30.2 percent citation precision on both single-fact and multi-fact tasks, with further gains as the volume of augmented data increases.

Core claim

Active Indexing during continual pretraining binds factual knowledge to persistent document identifiers by training on synthetic augmentations that restate each fact in diverse compositional forms and enforce bidirectional mappings between sources and facts. After subsequent instruction tuning, the resulting models generate content from cited sources and attribute their own answers with higher precision than a passive baseline that merely appends identifiers, with the advantage holding across short-form and long-form citation tasks and scaling as augmented data volume grows.

What carries the argument

Active Indexing, which augments pretraining data with compositional restatements and bidirectional source-to-fact training to create generalizable bindings between facts and document identifiers.

If this is right

Citation precision continues to rise as the amount of augmented synthetic data scales to at least 16 times the original token count.
Internal citations improve robustness when the model is later given noisy external retrieval results.
The same binding approach supports both single-fact short answers and multi-fact long-form generation.
The method works across model sizes tested, including 3B and 7B Qwen-2.5 variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Removing the external retriever could simplify deployment of citation systems in resource-constrained environments.
Tying outputs to specific training documents may offer a route to audit or edit model knowledge by editing or removing source documents.
The bidirectional training pattern could be adapted to other attribution tasks such as tracing reasoning steps back to training examples.

Load-bearing premise

The synthetic data augmentations will create bindings that generalize to real user queries rather than only matching the synthetic distribution.

What would settle it

Citation precision on a held-out set of natural user queries falls below the synthetic benchmark results by more than the gap seen between active and passive indexing.

Figures

Figures reproduced from arXiv: 2506.17585 by Bhuwan Dhingra, Jian Pei, Manzil Zaheer, Sanxing Chen, Yukun Huang.

**Figure 2.** Figure 2: Scaling Curve of Combining Backward and Forward on RepliQA Diverse Fact Representations Help Citation Active Indexing generates diverse fact variants—through paraphrasing, composition, and interaction—all tied to the same document ID. This diversity helps the model generalize and reliably cite, improving both memorization and utilization. To evaluate how diversity impacts citation ability, we study how c… view at source ↗

**Figure 3.** Figure 3: Scaling Comparison Between Active Indexing and Passive Indexing on RepliQA [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs can be trained to reliably attribute citations to documents encountered during continual pretraining without test-time retrieval. It introduces CitePretrainBench (mixing real corpora such as Wikipedia, Common Crawl, and arXiv with novel documents) for short-form and long-form citation tasks, and proposes a two-stage method: Active Indexing via continual pretraining that augments data with diverse compositional restatements plus bidirectional (source-to-fact and fact-to-source) pairs to create source-anchored bindings, followed by instruction tuning. Experiments on Qwen-2.5-7B and 3B models show Active Indexing outperforming a Passive Indexing baseline (simple identifier appending) with citation precision gains up to 30.2%, and an upward performance trend as augmented data scales to 16x the original token count.

Significance. If the central empirical claims hold and generalize, the work offers a promising direction for retrieval-free citation in LLMs, which could reduce latency, infrastructure costs, and vulnerability to retrieval noise while complementing external retrieval. The scaling ablation showing continued gains with more augmented data is a clear strength that supports the method's viability. The introduction of CitePretrainBench also provides a useful resource for studying attribution.

major comments (2)

[Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.
[Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.

minor comments (2)

[Abstract] Abstract: The maximum gain of 30.2% is stated without indicating the specific task, model size, or condition under which it is achieved, which would improve immediate readability of the key result.
[Benchmark construction] The description of how novel documents are mixed into the benchmark and how test queries are sampled could be expanded for reproducibility, even if high-level details are present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that certain aspects can be strengthened through revisions and have outlined specific changes to be incorporated in the revised version.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.

Authors: We agree that reporting statistical significance tests, error bars, and run-to-run variance would strengthen the reliability of the empirical results. Our original experiments used single runs due to the high computational cost of continual pretraining for the Qwen-2.5-7B and 3B models. In the revised manuscript, we will conduct additional runs with varied random seeds for the main experiments, report means and standard deviations, and include statistical significance tests (such as paired t-tests) for the citation precision gains. We will also update the abstract and results sections to reflect these details. revision: yes
Referee: [Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.

Authors: We appreciate this observation on the generalization of the bindings to natural distributions. CitePretrainBench mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents, and the evaluation queries are drawn from this mixture to reflect realistic syntactic and compositional patterns. The performance improvements on both short-form and long-form tasks, along with the scaling trend up to 16x augmented data, provide supporting evidence for transfer. However, we acknowledge that the benchmark does not explicitly isolate queries with patterns fully disjoint from the augmentations. In the revised manuscript, we will add a dedicated discussion of this limitation and suggest it as future work, while moderating the strength of the generalization claims in the relevant sections. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical training and evaluation chain

full rationale

The paper defines Active Indexing explicitly as a training augmentation strategy (diverse compositional restatements plus bidirectional source-to-fact and fact-to-source pairs) during continual pretraining, then measures citation precision against an independent Passive Indexing baseline on CitePretrainBench. Results, ablations on data scaling, and comparisons to external retrieval are reported as experimental outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No equations or uniqueness theorems are invoked that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested assumption that synthetic data can create generalizable source bindings without introducing distribution shift that harms downstream citation accuracy.

free parameters (1)

scale of augmented synthetic data
Ablations vary this from 1x to 16x original tokens and report continued improvement; the exact multiplier is chosen to demonstrate the trend.

pith-pipeline@v0.9.0 · 5875 in / 1139 out tokens · 27091 ms · 2026-05-19T07:36:40.226475+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Active Indexing... creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Measuring attribution in natural language generation models

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 2023. URL: https://aclanthology.org/2023.cl-4.2, doi:10.1162/coli_a_00486

work page doi:10.1162/coli_a_00486 2023
[2]

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. ArXiv preprint, abs/2310.07521, 2023. URL: https://arxiv.org/abs/2310.07521

work page arXiv 2023
[3]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Mano...

work page 2024
[4]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021

work page 2021
[5]

Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. Do language models know when they’re hallucinating references? In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024 , pages 912–928, St. Julian’s, Malta, 2024. Association for Computational Linguistics. URL: https://aclanthology. org/20...

work page 2024
[6]

Chatgpt hallucinates when attributing answers

Guido Zuccon, Bevan Koopman, and Razia Shaik. Chatgpt hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 46–51, 2023

work page 2023
[7]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. ArXiv preprint, abs/2203.11147, 2022. URL: https://arxiv.org/abs/2203.11147

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Rethinking with retrieval: Faithful large language model inference

Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. ArXiv preprint, abs/2301.00303, 2023. URL: https://arxiv. org/abs/2301.00303

work page arXiv 2023
[10]

RARR: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Associati...

work page
[11]

URL: https://aclanthology.org/2023

Association for Computational Linguistics. URL: https://aclanthology.org/2023. acl-long.910, doi:10.18653/v1/2023.acl-long.910

work page doi:10.18653/v1/2023.acl-long.910 2023
[12]

How context affects language models’ factual predictions

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. ArXiv preprint, abs/2005.04611, 2020. URL: https://arxiv.org/abs/2005.04611

work page arXiv 2005
[13]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024
[14]

URL: https://openreview.net/forum?id=auKAUJZMO6

OpenReview.net, 2024. URL: https://openreview.net/forum?id=auKAUJZMO6

work page 2024
[15]

Automatic evaluation of attribution by large language models

Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology....

work page doi:10.18653/v1/2023.findings-emnlp.307 2023
[16]

Enabling large language models to gen- erate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to gen- erate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore, 2023. Association for Computational Linguistics. URL: https://aclant...

work page doi:10.18653/v1/2023.emnlp-main.398 2023
[17]

Source-aware training enables knowledge attribution in language models

Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. Source-aware training enables knowledge attribution in language models. In First Conference on Language Modeling , 2024. URL: https://openreview.net/forum?id= UPyWLwciYz

work page 2024
[18]

Evaluating verifiability in generative search engines

Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology.org/2023.findings-emnlp. 467, ...

work page doi:10.18653/v1/2023.findings-emnlp.467 2023
[19]

Effective large language model adaptation for improved grounding and citation generation

Xi Ye, Ruoxi Sun, Sercan Arik, and Tomas Pfister. Effective large language model adaptation for improved grounding and citation generation. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

work page 2024
[20]

Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution

Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. ArXiv preprint, abs/2307.16883, 2023. URL: https://arxiv.org/abs/2307.16883

work page arXiv 2023
[21]

Training language models to generate text with citations via fine-grained rewards

Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. ArXiv preprint, abs/2402.04315, 2024. URL: https://arxiv.org/abs/2402.04315. 11

work page arXiv 2024
[23]

URL: https://arxiv.org/abs/2502.09604

work page arXiv
[24]

To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts

Yukun Huang, Sanxing Chen, Hongyi Cai, and Bhuwan Dhingra. To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview. net/forum?id=K2jOacHUlO

work page 2025
[25]

Recitation-augmented language models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL: https://openreview. net/pdf?id=-cqvvvb-NkI

work page 2023
[26]

according to

Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “according to . . . ”: Prompting language models improves quoting from pre-training data. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long ...

work page 2024
[27]

Generative retrieval with large language models

Ye Wang, Xinrun Xu, Rui Xie, Wenxin Hu, and Wei Ye. Generative retrieval with large language models. ArXiv preprint, abs/2402.17010, 2024. URL: https://arxiv.org/abs/ 2402.17010

work page arXiv 2024
[28]

Verifiable by design: Aligning language models to quote from pre-training data

Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, and Daniel Khashabi. Verifiable by design: Aligning language models to quote from pre-training data. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page 2025
[29]

ASQA: Factoid questions meet long-form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. U...

work page 2022
[30]

KILT: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Belt...

work page doi:10.18653/v1/2021.naacl-main.200 2021
[31]

ELI5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3558–3567, Florence, Italy, 2019. Association for Computational Linguistics. ...

work page doi:10.18653/v1/p19-1346 2019
[32]

CCNet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Jos...

work page 2020
[33]

Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation

Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, and Ian Foster. Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation. ArXiv preprint, abs/2405.09939, 2024. URL: https://arxiv.org/abs/2405.09939

work page arXiv 2024
[34]

Repliqa: A question-answering dataset for benchmarking llms on unseen reference content

João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar Mudumba, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Chris Pal, and Perouz Taslakian. Repliqa: A question-answering dataset for benchmarking llms on unseen reference content. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...

work page 2024
[35]

FreshLLMs: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguis- tics: ACL 2024 , pages 13697–...

work page doi:10.18653/v1/2024.findings-acl.813 2024
[36]

TrueTeacher: Learning factual consistency evaluation with large language models

Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing , pages 2053–2070, Singapore, 2023. Association for Co...

work page doi:10.18653/v1/2023.emnlp-main.127 2023
[37]

Physics of language models: Part 3.2, knowledge manipula- tion

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=oDbiL9CLoS

work page 2025
[38]

Synthetic continued pretraining

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto. Synthetic continued pretraining. In The Thirteenth International Conference on Learning Representations,

work page
[39]

URL: https://openreview.net/forum?id=07yvxWDSla

work page
[40]

Generalization v.s

Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=IQxBDLmVpT

work page 2025
[41]

The web is your oyster - knowledge-intensive nlp against a very large web corpus

Aleksandra Piktus, Fabio Petroni, Yizhong Wang, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Ouguz, Edouard Grave, Wen tau Yih, and Sebastian Riedel. The web is your oyster - knowledge-intensive nlp against a very large web corpus. ArXiv preprint, abs/2112.09924, 2021. URL: https://arxiv.org/abs/2112. 09924

work page arXiv 2021
[42]

Cohen, and Donald Met- zler

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Met- zler. Transformer memory as a differentiable search index. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neu- ral Information Proces...

work page 2022
[43]

From matching to generation: A survey on generative information retrieval

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems , 2024. URL: https://api.semanticscholar.org/CorpusID: 269303210

work page 2024
[44]

Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024...

work page doi:10.1145/3626772.3657778 2024
[45]

TRAK: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: attributing model behavior at scale. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 o...

work page 2023
[46]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum? id=gLa96FlWwn

work page 2025
[47]

Zettlemoyer, and Pang Wei Koh

Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hanna Hajishirzi, Luke S. Zettlemoyer, and Pang Wei Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation. ArXiv preprint, abs/2407.07087, 2024. URL: https://arxiv.org/abs/2407.07087

work page arXiv 2024
[48]

In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors, Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4N...

work page doi:10.18653/v1/2021.repl4nlp-1.17 2021
[49]

The model is directly prompted to rank text titles given document content

Natural Title (Raw): A baseline without continual pretraining. The model is directly prompted to rank text titles given document content. This tests whether pretrained LMs can match content to titles without exposure

work page
[50]

This approach uses natural-language identifiers that align with the model’s training distribution

Natural Titles : We perform continual pretraining where each document is appended with its human-written text title. This approach uses natural-language identifiers that align with the model’s training distribution

work page
[51]

Documents are embedded and clustered using K-means into 10 top-level groups

Hierarchical K-Means Integer (HKM-Integer): Instead of using random integers, we construct semantically structured integer IDs following [38]. Documents are embedded and clustered using K-means into 10 top-level groups. Each group is assigned a prefix digit. The process is recursively applied within each cluster, with each level adding a digit to the ID. ...

work page
[52]

For each cluster, we use an LLM to generate a representative keyword based on its most salient documents

Hierarchical LDA with Keyword Labels (HLDA-Keywords) We apply hierarchical topic modeling (LDA) to recursively cluster documents. For each cluster, we use an LLM to generate a representative keyword based on its most salient documents. The final identifier is a concatenation of these keywords along the cluster path, forming a semantic, hierarchical label....

work page
[53]

Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality

Keyword-First Domain Identifier (Keywords→Domain) Similar to the above, but constructed in a bottom-up manner. Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality. ID Type Acc@1 Acc@10 Natural Titles (Raw) 9.7 46.3 Natural Titles 53.3 75.3 HKM-Integer 2.0 21.7 HLDA-Keywords...

work page
[54]

This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation

Correct Answer with Faithful and Diverse Citations In ideal cases, the model not only produces a factually accurate and coherent answer, but also cites multiple distinct documents, each supporting a different part of the response. This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation. ...

work page 2012
[55]

This suggests a mismatch between content planning and citation generation

Correct Answer but Incorrect Citations In some cases, the generated answer is factually correct and well-structured, but the cited documents are irrelevant. This suggests a mismatch between content planning and citation generation. Example: Question: Why do online communities crumble as they gain popularity? Model Answer: Communities may lose cohesion as ...

work page
[56]

dry mouth

Faithful Citations but Incomplete Answer Sometimes, the model successfully grounds all claims in real documents, but the final answer fails to directly address the question. 19 Example: Question: Why do so many drugs cause “dry mouth” as a side effect? Model Answer: Many drugs cause xerostomia, or dry mouth. <|Understanding Medication Side Effects: The Pr...

work page
[57]

Title Lure

“Title Lure” Errors in Short-form QA In short-form QA tasks, the model sometimes selects citations solely based on title relevance, even when the document content lacks the required evidence. This reflects a superficial attribute mechanism. Example: Question: How is Boston addressing the digital divide in terms of communications technology from December 2...

work page 2023
[58]

Near Miss

Cross-Domain Lookalikes and “Near Miss” Citations Occasionally, the model cites from a mismatched domain—e.g., a general Wikipedia article instead of a domain-specific source like RepliQA—producing citations that superficially resemble the ground truth but lack factual alignment. Example: Question: When was the last game of Copenhagen’s basketball season ...

work page 2023

[1] [1]

Measuring attribution in natural language generation models

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 2023. URL: https://aclanthology.org/2023.cl-4.2, doi:10.1162/coli_a_00486

work page doi:10.1162/coli_a_00486 2023

[2] [2]

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. ArXiv preprint, abs/2310.07521, 2023. URL: https://arxiv.org/abs/2310.07521

work page arXiv 2023

[3] [3]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Mano...

work page 2024

[4] [4]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021

work page 2021

[5] [5]

Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. Do language models know when they’re hallucinating references? In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024 , pages 912–928, St. Julian’s, Malta, 2024. Association for Computational Linguistics. URL: https://aclanthology. org/20...

work page 2024

[6] [6]

Chatgpt hallucinates when attributing answers

Guido Zuccon, Bevan Koopman, and Razia Shaik. Chatgpt hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 46–51, 2023

work page 2023

[7] [7]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. ArXiv preprint, abs/2203.11147, 2022. URL: https://arxiv.org/abs/2203.11147

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Rethinking with retrieval: Faithful large language model inference

Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. ArXiv preprint, abs/2301.00303, 2023. URL: https://arxiv. org/abs/2301.00303

work page arXiv 2023

[10] [10]

RARR: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Associati...

work page

[11] [11]

URL: https://aclanthology.org/2023

Association for Computational Linguistics. URL: https://aclanthology.org/2023. acl-long.910, doi:10.18653/v1/2023.acl-long.910

work page doi:10.18653/v1/2023.acl-long.910 2023

[12] [12]

How context affects language models’ factual predictions

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. ArXiv preprint, abs/2005.04611, 2020. URL: https://arxiv.org/abs/2005.04611

work page arXiv 2005

[13] [13]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024

[14] [14]

URL: https://openreview.net/forum?id=auKAUJZMO6

OpenReview.net, 2024. URL: https://openreview.net/forum?id=auKAUJZMO6

work page 2024

[15] [15]

Automatic evaluation of attribution by large language models

Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology....

work page doi:10.18653/v1/2023.findings-emnlp.307 2023

[16] [16]

Enabling large language models to gen- erate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to gen- erate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore, 2023. Association for Computational Linguistics. URL: https://aclant...

work page doi:10.18653/v1/2023.emnlp-main.398 2023

[17] [17]

Source-aware training enables knowledge attribution in language models

Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. Source-aware training enables knowledge attribution in language models. In First Conference on Language Modeling , 2024. URL: https://openreview.net/forum?id= UPyWLwciYz

work page 2024

[18] [18]

Evaluating verifiability in generative search engines

Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology.org/2023.findings-emnlp. 467, ...

work page doi:10.18653/v1/2023.findings-emnlp.467 2023

[19] [19]

Effective large language model adaptation for improved grounding and citation generation

Xi Ye, Ruoxi Sun, Sercan Arik, and Tomas Pfister. Effective large language model adaptation for improved grounding and citation generation. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

work page 2024

[20] [20]

Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution

Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. ArXiv preprint, abs/2307.16883, 2023. URL: https://arxiv.org/abs/2307.16883

work page arXiv 2023

[21] [21]

Training language models to generate text with citations via fine-grained rewards

Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. ArXiv preprint, abs/2402.04315, 2024. URL: https://arxiv.org/abs/2402.04315. 11

work page arXiv 2024

[22] [23]

URL: https://arxiv.org/abs/2502.09604

work page arXiv

[23] [24]

To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts

Yukun Huang, Sanxing Chen, Hongyi Cai, and Bhuwan Dhingra. To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview. net/forum?id=K2jOacHUlO

work page 2025

[24] [25]

Recitation-augmented language models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL: https://openreview. net/pdf?id=-cqvvvb-NkI

work page 2023

[25] [26]

according to

Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “according to . . . ”: Prompting language models improves quoting from pre-training data. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long ...

work page 2024

[26] [27]

Generative retrieval with large language models

Ye Wang, Xinrun Xu, Rui Xie, Wenxin Hu, and Wei Ye. Generative retrieval with large language models. ArXiv preprint, abs/2402.17010, 2024. URL: https://arxiv.org/abs/ 2402.17010

work page arXiv 2024

[27] [28]

Verifiable by design: Aligning language models to quote from pre-training data

Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, and Daniel Khashabi. Verifiable by design: Aligning language models to quote from pre-training data. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page 2025

[28] [29]

ASQA: Factoid questions meet long-form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. U...

work page 2022

[29] [30]

KILT: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Belt...

work page doi:10.18653/v1/2021.naacl-main.200 2021

[30] [31]

ELI5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3558–3567, Florence, Italy, 2019. Association for Computational Linguistics. ...

work page doi:10.18653/v1/p19-1346 2019

[31] [32]

CCNet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Jos...

work page 2020

[32] [33]

Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation

Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, and Ian Foster. Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation. ArXiv preprint, abs/2405.09939, 2024. URL: https://arxiv.org/abs/2405.09939

work page arXiv 2024

[33] [34]

Repliqa: A question-answering dataset for benchmarking llms on unseen reference content

João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar Mudumba, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Chris Pal, and Perouz Taslakian. Repliqa: A question-answering dataset for benchmarking llms on unseen reference content. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...

work page 2024

[34] [35]

FreshLLMs: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguis- tics: ACL 2024 , pages 13697–...

work page doi:10.18653/v1/2024.findings-acl.813 2024

[35] [36]

TrueTeacher: Learning factual consistency evaluation with large language models

Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing , pages 2053–2070, Singapore, 2023. Association for Co...

work page doi:10.18653/v1/2023.emnlp-main.127 2023

[36] [37]

Physics of language models: Part 3.2, knowledge manipula- tion

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=oDbiL9CLoS

work page 2025

[37] [38]

Synthetic continued pretraining

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto. Synthetic continued pretraining. In The Thirteenth International Conference on Learning Representations,

work page

[38] [39]

URL: https://openreview.net/forum?id=07yvxWDSla

work page

[39] [40]

Generalization v.s

Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=IQxBDLmVpT

work page 2025

[40] [41]

The web is your oyster - knowledge-intensive nlp against a very large web corpus

Aleksandra Piktus, Fabio Petroni, Yizhong Wang, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Ouguz, Edouard Grave, Wen tau Yih, and Sebastian Riedel. The web is your oyster - knowledge-intensive nlp against a very large web corpus. ArXiv preprint, abs/2112.09924, 2021. URL: https://arxiv.org/abs/2112. 09924

work page arXiv 2021

[41] [42]

Cohen, and Donald Met- zler

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Met- zler. Transformer memory as a differentiable search index. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neu- ral Information Proces...

work page 2022

[42] [43]

From matching to generation: A survey on generative information retrieval

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems , 2024. URL: https://api.semanticscholar.org/CorpusID: 269303210

work page 2024

[43] [44]

Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024...

work page doi:10.1145/3626772.3657778 2024

[44] [45]

TRAK: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: attributing model behavior at scale. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 o...

work page 2023

[45] [46]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum? id=gLa96FlWwn

work page 2025

[46] [47]

Zettlemoyer, and Pang Wei Koh

Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hanna Hajishirzi, Luke S. Zettlemoyer, and Pang Wei Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation. ArXiv preprint, abs/2407.07087, 2024. URL: https://arxiv.org/abs/2407.07087

work page arXiv 2024

[47] [48]

In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors, Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4N...

work page doi:10.18653/v1/2021.repl4nlp-1.17 2021

[48] [49]

The model is directly prompted to rank text titles given document content

Natural Title (Raw): A baseline without continual pretraining. The model is directly prompted to rank text titles given document content. This tests whether pretrained LMs can match content to titles without exposure

work page

[49] [50]

This approach uses natural-language identifiers that align with the model’s training distribution

Natural Titles : We perform continual pretraining where each document is appended with its human-written text title. This approach uses natural-language identifiers that align with the model’s training distribution

work page

[50] [51]

Documents are embedded and clustered using K-means into 10 top-level groups

Hierarchical K-Means Integer (HKM-Integer): Instead of using random integers, we construct semantically structured integer IDs following [38]. Documents are embedded and clustered using K-means into 10 top-level groups. Each group is assigned a prefix digit. The process is recursively applied within each cluster, with each level adding a digit to the ID. ...

work page

[51] [52]

For each cluster, we use an LLM to generate a representative keyword based on its most salient documents

Hierarchical LDA with Keyword Labels (HLDA-Keywords) We apply hierarchical topic modeling (LDA) to recursively cluster documents. For each cluster, we use an LLM to generate a representative keyword based on its most salient documents. The final identifier is a concatenation of these keywords along the cluster path, forming a semantic, hierarchical label....

work page

[52] [53]

Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality

Keyword-First Domain Identifier (Keywords→Domain) Similar to the above, but constructed in a bottom-up manner. Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality. ID Type Acc@1 Acc@10 Natural Titles (Raw) 9.7 46.3 Natural Titles 53.3 75.3 HKM-Integer 2.0 21.7 HLDA-Keywords...

work page

[53] [54]

This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation

Correct Answer with Faithful and Diverse Citations In ideal cases, the model not only produces a factually accurate and coherent answer, but also cites multiple distinct documents, each supporting a different part of the response. This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation. ...

work page 2012

[54] [55]

This suggests a mismatch between content planning and citation generation

Correct Answer but Incorrect Citations In some cases, the generated answer is factually correct and well-structured, but the cited documents are irrelevant. This suggests a mismatch between content planning and citation generation. Example: Question: Why do online communities crumble as they gain popularity? Model Answer: Communities may lose cohesion as ...

work page

[55] [56]

dry mouth

Faithful Citations but Incomplete Answer Sometimes, the model successfully grounds all claims in real documents, but the final answer fails to directly address the question. 19 Example: Question: Why do so many drugs cause “dry mouth” as a side effect? Model Answer: Many drugs cause xerostomia, or dry mouth. <|Understanding Medication Side Effects: The Pr...

work page

[56] [57]

Title Lure

“Title Lure” Errors in Short-form QA In short-form QA tasks, the model sometimes selects citations solely based on title relevance, even when the document content lacks the required evidence. This reflects a superficial attribute mechanism. Example: Question: How is Boston addressing the digital divide in terms of communications technology from December 2...

work page 2023

[57] [58]

Near Miss

Cross-Domain Lookalikes and “Near Miss” Citations Occasionally, the model cites from a mismatched domain—e.g., a general Wikipedia article instead of a domain-specific source like RepliQA—producing citations that superficially resemble the ground truth but lack factual alignment. Example: Question: When was the last game of Copenhagen’s basketball season ...

work page 2023