arxiv: 2604.05764 · v2 · submitted 2026-04-07 · 💻 cs.IR

Recognition: no theorem link

Generative Retrieval Overcomes Limitations of Dense Retrieval but Struggles with Identifier Ambiguity

Adrian Bracher , Svitlana Vakulenko

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:56 UTC · model grok-4.3

classification 💻 cs.IR

keywords generative retrievaldense retrievalidentifier ambiguityLIMIT datasethard negativesdecoding failureinformation retrievalsynthetic evaluation

0 comments

The pith

Generative retrieval outperforms dense and sparse baselines on the LIMIT dataset but degrades when hard negatives introduce identifier ambiguity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests generative retrieval models on the LIMIT dataset, a synthetic collection designed to expose theoretical weaknesses in dense embedding-based retrieval. Without extra training, models such as SEAL and MINDER reach high recall scores that exceed both dense approaches and BM25. Extending the dataset with simple hard-negative examples that share similar identifiers causes recall to fall sharply for generative models as well as for BM25. Error analysis traces the drop to the decoder's inability to generate document identifiers that remain unique under ambiguity. The work therefore concludes that generative retrieval must either adopt identifiers better suited to decoding or modify scoring to retain relevance signals.

Core claim

Generative retrieval achieves the best performance on this dataset without any additional training required (0.92 and 0.99 R@2 for SEAL and MINDER, respectively), compared to dense approaches (< 0.03 Recall@2) and BM25 (0.86 R@2). However, extending the original LIMIT dataset by adding simple hard negative samples causes performance to degrade for all the models including the generative retrieval models (0.51 R@2) as well as BM25 (0.21 R@2). Error analysis identifies a failure in the decoding mechanism, caused by the inability to produce identifiers that are unique to relevant documents.

What carries the argument

The LIMIT dataset together with its hard-negative extension, which isolates identifier ambiguity by forcing models to retrieve documents whose identifiers are not uniquely determined by the query; the autoregressive decoder that must emit exact document identifiers rather than ranking embeddings.

Load-bearing premise

The synthetic LIMIT dataset and its hard-negative extension sufficiently capture the identifier ambiguity problem that would appear in real-world corpora and that the observed decoding failure is the primary cause rather than other model or training factors.

What would settle it

Run the same generative models on a real-world corpus containing known overlapping document identifiers and measure whether recall@2 drops by a comparable margin when hard negatives are added.

Figures

Figures reproduced from arXiv: 2604.05764 by Adrian Bracher, Svitlana Vakulenko.

**Figure 2.** Figure 2: Comparison of Recall@2 performance across three [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

While dense retrieval models, which embed queries and documents into a shared low-dimensional space, have gained widespread popularity, they were shown to exhibit important theoretical limitations and considerably lag behind traditional sparse retrieval models in certain settings. Generative retrieval has emerged as an alternative approach to dense retrieval by using a language model to predict query-document relevance directly. In this paper, we demonstrate strengths and weaknesses of generative retrieval approaches using a simple synthetic dataset, called LIMIT, that was previously introduced to empirically demonstrate the theoretical limitations of embedding-based retrieval but was not used to evaluate generative retrieval. We close this research gap and show that generative retrieval achieves the best performance on this dataset without any additional training required (0.92 and 0.99 R@2 for SEAL and MINDER, respectively), compared to dense approaches (< 0.03 Recall@2) and BM25 (0.86 R@2). However, we then proceed to extend the original LIMIT dataset by adding simple hard negative samples and observe the performance degrading for all the models including the generative retrieval models (0.51 R@2) as well as BM25 (0.21 R@2). Error analysis identifies a failure in the decoding mechanism, caused by the inability to produce identifiers that are unique to relevant documents. Future generative retrieval must address these issues, either by designing identifiers that are more suitable to the decoding process or by adapting decoding and scoring algorithms to preserve relevance signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Generative retrieval beats dense methods on the original LIMIT set but the drop with hard negatives is only loosely tied to identifier ambiguity.

read the letter

The punchline is that generative retrieval beats dense retrieval on the original LIMIT dataset but falls apart when simple hard negatives are added, and the paper attributes this to problems with generating unique document identifiers. This is the first evaluation of generative models like SEAL and MINDER on LIMIT, which was previously only used for dense retrieval. The authors extend the dataset with hard negatives and provide an error analysis. Without any additional training, the generative approaches reach 0.92 and 0.99 recall at rank 2, far above dense methods at under 0.03 and even ahead of BM25 at 0.86. With the hard negatives, performance drops across the board to 0.51 for generative models and 0.21 for BM25. The analysis points to a decoding failure where the model cannot produce identifiers unique to the relevant documents. The concrete numbers and the direct comparison are the strengths here. It gives a practical example of where generative retrieval has an edge and where it still needs work. The soft spot is that the explanation for the drop relies on the error analysis without showing that the model assigns high probability to the correct identifier when negatives are present. It could be that relevance modeling breaks down more generally. The dataset is synthetic, which keeps things controlled but limits how far the findings generalize to real-world identifier ambiguity. There are also no statistical significance tests or full method details visible. This work is aimed at information retrieval researchers who are building or studying generative retrieval systems. Someone thinking about identifier design or decoding strategies would find it relevant. It deserves peer review because the empirical comparison is clear and the identified limitation is actionable for future research. I would recommend sending it to referees, with the expectation that they ask for more evidence on the internal model behavior and perhaps tests on non-synthetic data.

Referee Report

2 major / 2 minor

Summary. The paper evaluates generative retrieval models SEAL and MINDER on the synthetic LIMIT dataset previously used to demonstrate limitations of dense retrieval. It reports that these models achieve high Recall@2 (0.92 and 0.99) without additional training, outperforming dense retrieval (<0.03) and BM25 (0.86). Extending LIMIT with hard negative samples causes performance to drop for all methods (generative models to 0.51 R@2, BM25 to 0.21 R@2). Error analysis attributes the generative models' degradation to a decoding failure arising from inability to produce identifiers unique to relevant documents, and recommends future work on identifier design or adapted decoding/scoring.

Significance. If the attribution of the performance drop specifically to identifier ambiguity in decoding is correct, the result identifies a concrete practical limitation of current generative retrieval that is not shared by the same degree with BM25 on this data. The use of the existing LIMIT dataset to test generative retrieval is a useful contribution, and the zero-training performance numbers provide a clear baseline. However, the synthetic construction and lack of verification that relevance signals remain intact internally limit the strength of the conclusions for real corpora.

major comments (2)

The error analysis claims the R@2 drop from 0.92/0.99 to 0.51 is caused by 'failure in the decoding mechanism' due to 'inability to produce identifiers that are unique to relevant documents.' No supporting measurements are described (per-token probabilities on correct vs. hard-negative identifiers, oracle-ID scoring, or constrained-decoding ablation) to show that the model still assigns higher probability to the correct identifier when hard negatives are present. Without such evidence the degradation could instead reflect failure to model relevance at all once negatives are introduced, undermining the specific claim about identifier ambiguity.
The hard-negative extension of LIMIT is described only as 'adding simple hard negative samples.' No details are given on how the negatives are sampled, whether they preserve the original theoretical limitations of the dataset, or how many negatives are added per query. This construction is load-bearing for the central claim that generative retrieval 'struggles with identifier ambiguity' rather than with the introduction of any hard negatives.

minor comments (2)

The abstract and results report single-point Recall@2 numbers without error bars, multiple random seeds, or statistical significance tests; adding these would strengthen the comparison between 0.51 and 0.21.
It is unclear from the provided description whether the dense retrieval baselines were re-trained or used off-the-shelf on the extended dataset; explicit statement of the experimental protocol for all methods would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the evidence and documentation as appropriate.

read point-by-point responses

Referee: The error analysis claims the R@2 drop from 0.92/0.99 to 0.51 is caused by 'failure in the decoding mechanism' due to 'inability to produce identifiers that are unique to relevant documents.' No supporting measurements are described (per-token probabilities on correct vs. hard-negative identifiers, oracle-ID scoring, or constrained-decoding ablation) to show that the model still assigns higher probability to the correct identifier when hard negatives are present. Without such evidence the degradation could instead reflect failure to model relevance at all once negatives are introduced, undermining the specific claim about identifier ambiguity.

Authors: We acknowledge that the original manuscript's error analysis was primarily qualitative, based on inspection of generated identifier sequences that frequently matched hard negatives. To directly address this, we have added quantitative measurements: per-token log-probabilities comparing the correct identifier against hard-negative identifiers, an oracle-ID scoring experiment (forcing generation of the ground-truth identifier and measuring relevance capture), and a constrained-decoding ablation. These results show the model assigns higher probability to correct identifiers even with hard negatives present, but beam decoding fails due to prefix ambiguity. The revised paper includes these in a new subsection of the error analysis. revision: yes
Referee: The hard-negative extension of LIMIT is described only as 'adding simple hard negative samples.' No details are given on how the negatives are sampled, whether they preserve the original theoretical limitations of the dataset, or how many negatives are added per query. This construction is load-bearing for the central claim that generative retrieval 'struggles with identifier ambiguity' rather than with the introduction of any hard negatives.

Authors: We agree the description was insufficient and have expanded it substantially. Hard negatives were sampled by selecting non-relevant documents whose identifiers share the first three tokens with the relevant document's identifier (creating prefix ambiguity) while differing in the suffix; we added three such negatives per query. We explicitly verified that the original LIMIT properties are preserved, including that dense retrieval Recall@2 remains below 0.05. The revised manuscript now includes a dedicated paragraph in the Dataset section with the sampling procedure, exact counts, and preservation checks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on synthetic dataset

full rationale

The paper reports direct experimental results comparing off-the-shelf generative retrieval models (SEAL, MINDER) to dense retrievers and BM25 on the existing LIMIT dataset and a manually extended version with hard negatives. All reported numbers (Recall@2 values, performance drops) are measured outcomes from running the models; no parameters are fitted to the target metric, no equations derive one quantity from another by construction, and no self-citations supply load-bearing uniqueness theorems or ansatzes. The error analysis is an interpretive post-hoc examination of decoding failures rather than a deductive chain that reduces the central claim to its own inputs. The work is therefore self-contained empirical evaluation with no reduction of predictions to fitted inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on existing generative retrieval models (SEAL, MINDER) and a synthetic dataset from prior literature.

pith-pipeline@v0.9.0 · 5559 in / 1121 out tokens · 26545 ms · 2026-05-10T18:56:20.092192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: Generating substrings as document identifiers.Advances in Neural Information Processing Systems35 (2022), 31668–31683

2022
[2]

Mohsen Fayyaz, Ali Modarressi, Hinrich Schütze, and Nanyun Peng. 2025. Col- lapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics

2025
[3]

Ferragina and G

P. Ferragina and G. Manzini. 2000. Opportunistic data structures with applica- tions. InProceedings 41st Annual Symposium on Foundations of Computer Science. 390–398. doi:10.1109/SFCS.2000.892127 Adrian Bracher and Svitlana Vakulenko

work page doi:10.1109/sfcs.2000.892127 2000
[4]

Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. 2023. Doc2Query–: when less is more. InEuropean Conference on Information Retrieval. Springer, 414–422

2023
[5]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised Dense Information Retrieval with Contrastive Learning. arXiv:2112.09118 [cs.IR] https://arxiv.org/abs/2112.09118

work page internal anchor Pith review arXiv 2022
[6]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

2020
[7]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mo- hamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461(2019)

work page internal anchor Pith review arXiv 2019
[8]

Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2023. Multiview Identifiers Enhanced Generative Retrieval. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 6636–6648

2023
[9]

Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li. 2024. Learning to Rank in Generative Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8716–8723. https://doi.org/10.1609/aaai.v38i8.28717

work page doi:10.1609/aaai.v38i8.28717 2024
[10]

Yongqi Li, Ruqing Zhang, Jiafeng Guo, et al . 2024. Generative Retrieval as Multi-Vector Dense Retrieval. InProceedings of the 47th International ACM SIGIR Conference. Provides theoretical grounding for GR’s high representational capacity compared to DR

2024
[11]

Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero- shot neural passage retrieval via domain-targeted synthetic question generation. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 1075–1088

2021
[12]

Kidist Amde Mekonnen, Yubao Tang, and Maarten de Rijke. 2025. Light- weight and Direct Document Relevance Optimization for Generative Infor- mation Retrieval. InProceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR ’25). https: //arxiv.org/abs/2504.05181 Introduces direct pairwise ranking ...

work page arXiv 2025
[13]

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery. Online preprint6, 2 (2019)

2019
[14]

Nils Reimers and Iryna Gurevych. 2021. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 605– 611

2021
[15]

Weiwei Sun, Keyi Kong, Xinyu Ma, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren, and Yiming Yang. 2025. ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval.arXiv preprint arXiv:2510.10419(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index.Advances in Neural Information Processing Systems 35 (2022), 21831–21843

2022
[17]

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al . 2022. A neural corpus indexer for document retrieval.Advances in Neural Information Processing Systems35 (2022), 25600–25614

2022
[18]

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. 2025. On the the- oretical limitations of embedding-based retrieval.arXiv preprint arXiv:2508.21038 (2025)

work page arXiv 2025
[19]

Orion Weller, Benjamin Van Durme, Dawn Lawrie, Ashwin Paranjape, Yuhao Zhang, and Jack Hessel. [n. d.]. Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models. InThe Thirteenth International Conference on Learning Representations
[20]

Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Muhammad Sarwar, Tianxin Wei, and Hamed Zamani. 2024. Scalable and effective generative information retrieval. In Proceedings of the ACM Web Conference 2024. 1441–1452

2024
[21]

Hansi Zeng, Chen Luo, and Hamed Zamani. 2024. Planning ahead in generative retrieval: Guiding autoregressive generation through simultaneous decoding. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 469–480

2024
[22]

Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2025. Does Generative Retrieval Overcome the Limitations of Dense Retrieval?arXiv preprint arXiv:2509.22116(2025)

work page arXiv 2025
[23]

Zhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, and Zhaochun Ren. 2025. Replication and Exploration of Generative Retrieval over Dynamic Corpora. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3325–3334

2025