Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3
The pith
LLMs can learn reliable citations to their own pretraining documents without any external retrieval at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Active Indexing during continual pretraining binds factual knowledge to persistent document identifiers by training on synthetic augmentations that restate each fact in diverse compositional forms and enforce bidirectional mappings between sources and facts. After subsequent instruction tuning, the resulting models generate content from cited sources and attribute their own answers with higher precision than a passive baseline that merely appends identifiers, with the advantage holding across short-form and long-form citation tasks and scaling as augmented data volume grows.
What carries the argument
Active Indexing, which augments pretraining data with compositional restatements and bidirectional source-to-fact training to create generalizable bindings between facts and document identifiers.
If this is right
- Citation precision continues to rise as the amount of augmented synthetic data scales to at least 16 times the original token count.
- Internal citations improve robustness when the model is later given noisy external retrieval results.
- The same binding approach supports both single-fact short answers and multi-fact long-form generation.
- The method works across model sizes tested, including 3B and 7B Qwen-2.5 variants.
Where Pith is reading between the lines
- Removing the external retriever could simplify deployment of citation systems in resource-constrained environments.
- Tying outputs to specific training documents may offer a route to audit or edit model knowledge by editing or removing source documents.
- The bidirectional training pattern could be adapted to other attribution tasks such as tracing reasoning steps back to training examples.
Load-bearing premise
The synthetic data augmentations will create bindings that generalize to real user queries rather than only matching the synthetic distribution.
What would settle it
Citation precision on a held-out set of natural user queries falls below the synthetic benchmark results by more than the gap seen between active and passive indexing.
Figures
read the original abstract
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs can be trained to reliably attribute citations to documents encountered during continual pretraining without test-time retrieval. It introduces CitePretrainBench (mixing real corpora such as Wikipedia, Common Crawl, and arXiv with novel documents) for short-form and long-form citation tasks, and proposes a two-stage method: Active Indexing via continual pretraining that augments data with diverse compositional restatements plus bidirectional (source-to-fact and fact-to-source) pairs to create source-anchored bindings, followed by instruction tuning. Experiments on Qwen-2.5-7B and 3B models show Active Indexing outperforming a Passive Indexing baseline (simple identifier appending) with citation precision gains up to 30.2%, and an upward performance trend as augmented data scales to 16x the original token count.
Significance. If the central empirical claims hold and generalize, the work offers a promising direction for retrieval-free citation in LLMs, which could reduce latency, infrastructure costs, and vulnerability to retrieval noise while complementing external retrieval. The scaling ablation showing continued gains with more augmented data is a clear strength that supports the method's viability. The introduction of CitePretrainBench also provides a useful resource for studying attribution.
major comments (2)
- [Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.
- [Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.
minor comments (2)
- [Abstract] Abstract: The maximum gain of 30.2% is stated without indicating the specific task, model size, or condition under which it is achieved, which would improve immediate readability of the key result.
- [Benchmark construction] The description of how novel documents are mixed into the benchmark and how test queries are sampled could be expanded for reproducibility, even if high-level details are present.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that certain aspects can be strengthened through revisions and have outlined specific changes to be incorporated in the revised version.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim of consistent outperformance with gains up to 30.2% citation precision across tasks and models is reported without statistical significance tests, error bars, or details on run-to-run variance. This leaves the reliability of the Active Indexing advantage only moderately supported, especially given the reader's note on the absence of these elements in the abstract and results.
Authors: We agree that reporting statistical significance tests, error bars, and run-to-run variance would strengthen the reliability of the empirical results. Our original experiments used single runs due to the high computational cost of continual pretraining for the Qwen-2.5-7B and 3B models. In the revised manuscript, we will conduct additional runs with varied random seeds for the main experiments, report means and standard deviations, and include statistical significance tests (such as paired t-tests) for the citation precision gains. We will also update the abstract and results sections to reflect these details. revision: yes
-
Referee: [Active Indexing and CitePretrainBench] Active Indexing and CitePretrainBench sections: The method's effectiveness rests on the assumption that synthetic augmentations (diverse compositional restatements and bidirectional pairs) produce bindings that transfer to natural query distributions. The benchmark mixes novel documents but does not isolate or test performance on queries whose syntactic and compositional patterns avoid those deliberately injected during augmentation, which is load-bearing for the generalization claim underlying the 30.2% gain.
Authors: We appreciate this observation on the generalization of the bindings to natural distributions. CitePretrainBench mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents, and the evaluation queries are drawn from this mixture to reflect realistic syntactic and compositional patterns. The performance improvements on both short-form and long-form tasks, along with the scaling trend up to 16x augmented data, provide supporting evidence for transfer. However, we acknowledge that the benchmark does not explicitly isolate queries with patterns fully disjoint from the augmentations. In the revised manuscript, we will add a dedicated discussion of this limitation and suggest it as future work, while moderating the strength of the generalization claims in the relevant sections. revision: partial
Circularity Check
No significant circularity in empirical training and evaluation chain
full rationale
The paper defines Active Indexing explicitly as a training augmentation strategy (diverse compositional restatements plus bidirectional source-to-fact and fact-to-source pairs) during continual pretraining, then measures citation precision against an independent Passive Indexing baseline on CitePretrainBench. Results, ablations on data scaling, and comparisons to external retrieval are reported as experimental outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No equations or uniqueness theorems are invoked that reduce the central claim to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- scale of augmented synthetic data
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Active Indexing... creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Measuring attribution in natural language generation models
Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 2023. URL: https://aclanthology.org/2023.cl-4.2, doi:10.1162/coli_a_00486
-
[2]
Survey on factuality in large language models: Knowledge, retrieval and domain-specificity
Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. ArXiv preprint, abs/2310.07521, 2023. URL: https://arxiv.org/abs/2310.07521
-
[3]
Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Mano...
work page 2024
-
[4]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021
work page 2021
-
[5]
Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Kalai. Do language models know when they’re hallucinating references? In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024 , pages 912–928, St. Julian’s, Malta, 2024. Association for Computational Linguistics. URL: https://aclanthology. org/20...
work page 2024
-
[6]
Chatgpt hallucinates when attributing answers
Guido Zuccon, Bevan Koopman, and Razia Shaik. Chatgpt hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 46–51, 2023
work page 2023
-
[7]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Teaching language models to support answers with verified quotes
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. ArXiv preprint, abs/2203.11147, 2022. URL: https://arxiv.org/abs/2203.11147
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Rethinking with retrieval: Faithful large language model inference
Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. ArXiv preprint, abs/2301.00303, 2023. URL: https://arxiv. org/abs/2301.00303
-
[10]
RARR: Researching and revising what language models say, using language models
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Researching and revising what language models say, using language models. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Associati...
-
[11]
URL: https://aclanthology.org/2023
Association for Computational Linguistics. URL: https://aclanthology.org/2023. acl-long.910, doi:10.18653/v1/2023.acl-long.910
-
[12]
How context affects language models’ factual predictions
Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. ArXiv preprint, abs/2005.04611, 2020. URL: https://arxiv.org/abs/2005.04611
-
[13]
Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
work page 2024
-
[14]
URL: https://openreview.net/forum?id=auKAUJZMO6
OpenReview.net, 2024. URL: https://openreview.net/forum?id=auKAUJZMO6
work page 2024
-
[15]
Automatic evaluation of attribution by large language models
Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology....
-
[16]
Enabling large language models to gen- erate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to gen- erate text with citations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore, 2023. Association for Computational Linguistics. URL: https://aclant...
-
[17]
Source-aware training enables knowledge attribution in language models
Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. Source-aware training enables knowledge attribution in language models. In First Conference on Language Modeling , 2024. URL: https://openreview.net/forum?id= UPyWLwciYz
work page 2024
-
[18]
Evaluating verifiability in generative search engines
Nelson Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore, 2023. Association for Computational Linguistics. URL: https://aclanthology.org/2023.findings-emnlp. 467, ...
-
[19]
Effective large language model adaptation for improved grounding and citation generation
Xi Ye, Ruoxi Sun, Sercan Arik, and Tomas Pfister. Effective large language model adaptation for improved grounding and citation generation. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...
work page 2024
-
[20]
Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution
Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. ArXiv preprint, abs/2307.16883, 2023. URL: https://arxiv.org/abs/2307.16883
-
[21]
Training language models to generate text with citations via fine-grained rewards
Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. ArXiv preprint, abs/2402.04315, 2024. URL: https://arxiv.org/abs/2402.04315. 11
- [23]
-
[24]
Yukun Huang, Sanxing Chen, Hongyi Cai, and Bhuwan Dhingra. To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview. net/forum?id=K2jOacHUlO
work page 2025
-
[25]
Recitation-augmented language models
Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL: https://openreview. net/pdf?id=-cqvvvb-NkI
work page 2023
-
[26]
Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. “according to . . . ”: Prompting language models improves quoting from pre-training data. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long ...
work page 2024
-
[27]
Generative retrieval with large language models
Ye Wang, Xinrun Xu, Rui Xie, Wenxin Hu, and Wei Ye. Generative retrieval with large language models. ArXiv preprint, abs/2402.17010, 2024. URL: https://arxiv.org/abs/ 2402.17010
-
[28]
Verifiable by design: Aligning language models to quote from pre-training data
Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, and Daniel Khashabi. Verifiable by design: Aligning language models to quote from pre-training data. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...
work page 2025
-
[29]
ASQA: Factoid questions meet long-form answers
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. U...
work page 2022
-
[30]
KILT: a benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Belt...
-
[31]
ELI5: Long form question answering
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3558–3567, Florence, Italy, 2019. Association for Computational Linguistics. ...
-
[32]
CCNet: Extracting high quality monolingual datasets from web crawl data
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Jos...
work page 2020
-
[33]
Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, and Ian Foster. Sciqag: A framework for auto-generated science question answering dataset with fine-grained evaluation. ArXiv preprint, abs/2405.09939, 2024. URL: https://arxiv.org/abs/2405.09939
-
[34]
Repliqa: A question-answering dataset for benchmarking llms on unseen reference content
João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar Mudumba, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Chris Pal, and Perouz Taslakian. Repliqa: A question-answering dataset for benchmarking llms on unseen reference content. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...
work page 2024
-
[35]
FreshLLMs: Refreshing large language models with search engine augmentation
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguis- tics: ACL 2024 , pages 13697–...
-
[36]
TrueTeacher: Learning factual consistency evaluation with large language models
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing , pages 2053–2070, Singapore, 2023. Association for Co...
-
[37]
Physics of language models: Part 3.2, knowledge manipula- tion
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=oDbiL9CLoS
work page 2025
-
[38]
Synthetic continued pretraining
Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto. Synthetic continued pretraining. In The Thirteenth International Conference on Learning Representations,
-
[39]
URL: https://openreview.net/forum?id=07yvxWDSla
-
[40]
Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=IQxBDLmVpT
work page 2025
-
[41]
The web is your oyster - knowledge-intensive nlp against a very large web corpus
Aleksandra Piktus, Fabio Petroni, Yizhong Wang, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Ouguz, Edouard Grave, Wen tau Yih, and Sebastian Riedel. The web is your oyster - knowledge-intensive nlp against a very large web corpus. ArXiv preprint, abs/2112.09924, 2021. URL: https://arxiv.org/abs/2112. 09924
-
[42]
Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Met- zler. Transformer memory as a differentiable search index. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neu- ral Information Proces...
work page 2022
-
[43]
From matching to generation: A survey on generative information retrieval
Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems , 2024. URL: https://api.semanticscholar.org/CorpusID: 269303210
work page 2024
-
[44]
Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks
Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024...
-
[45]
TRAK: attributing model behavior at scale
Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: attributing model behavior at scale. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 o...
work page 2023
-
[46]
Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney
Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum? id=gLa96FlWwn
work page 2025
-
[47]
Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hanna Hajishirzi, Luke S. Zettlemoyer, and Pang Wei Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation. ArXiv preprint, abs/2407.07087, 2024. URL: https://arxiv.org/abs/2407.07087
-
[48]
In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distilla- tion with tightly-coupled teachers for dense retrieval. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors, Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4N...
-
[49]
The model is directly prompted to rank text titles given document content
Natural Title (Raw): A baseline without continual pretraining. The model is directly prompted to rank text titles given document content. This tests whether pretrained LMs can match content to titles without exposure
-
[50]
This approach uses natural-language identifiers that align with the model’s training distribution
Natural Titles : We perform continual pretraining where each document is appended with its human-written text title. This approach uses natural-language identifiers that align with the model’s training distribution
-
[51]
Documents are embedded and clustered using K-means into 10 top-level groups
Hierarchical K-Means Integer (HKM-Integer): Instead of using random integers, we construct semantically structured integer IDs following [38]. Documents are embedded and clustered using K-means into 10 top-level groups. Each group is assigned a prefix digit. The process is recursively applied within each cluster, with each level adding a digit to the ID. ...
-
[52]
Hierarchical LDA with Keyword Labels (HLDA-Keywords) We apply hierarchical topic modeling (LDA) to recursively cluster documents. For each cluster, we use an LLM to generate a representative keyword based on its most salient documents. The final identifier is a concatenation of these keywords along the cluster path, forming a semantic, hierarchical label....
-
[53]
Keyword-First Domain Identifier (Keywords→Domain) Similar to the above, but constructed in a bottom-up manner. Keywords appear first, followed by the broader domain label (e.g., entropy- energy-physics), emphasizing specificity before generality. ID Type Acc@1 Acc@10 Natural Titles (Raw) 9.7 46.3 Natural Titles 53.3 75.3 HKM-Integer 2.0 21.7 HLDA-Keywords...
-
[54]
Correct Answer with Faithful and Diverse Citations In ideal cases, the model not only produces a factually accurate and coherent answer, but also cites multiple distinct documents, each supporting a different part of the response. This indicates that the model is capable of recalling parametric knowledge and integrating diverse sources during generation. ...
work page 2012
-
[55]
This suggests a mismatch between content planning and citation generation
Correct Answer but Incorrect Citations In some cases, the generated answer is factually correct and well-structured, but the cited documents are irrelevant. This suggests a mismatch between content planning and citation generation. Example: Question: Why do online communities crumble as they gain popularity? Model Answer: Communities may lose cohesion as ...
-
[56]
Faithful Citations but Incomplete Answer Sometimes, the model successfully grounds all claims in real documents, but the final answer fails to directly address the question. 19 Example: Question: Why do so many drugs cause “dry mouth” as a side effect? Model Answer: Many drugs cause xerostomia, or dry mouth. <|Understanding Medication Side Effects: The Pr...
-
[57]
“Title Lure” Errors in Short-form QA In short-form QA tasks, the model sometimes selects citations solely based on title relevance, even when the document content lacks the required evidence. This reflects a superficial attribute mechanism. Example: Question: How is Boston addressing the digital divide in terms of communications technology from December 2...
work page 2023
-
[58]
Cross-Domain Lookalikes and “Near Miss” Citations Occasionally, the model cites from a mismatched domain—e.g., a general Wikipedia article instead of a domain-specific source like RepliQA—producing citations that superficially resemble the ground truth but lack factual alignment. Example: Question: When was the last game of Copenhagen’s basketball season ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.