Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
Pith reviewed 2026-05-22 19:29 UTC · model grok-4.3
The pith
Generating synthetic hard negatives with LLMs for dense retrieval training underperforms traditional corpus-based mining with BM25 and cross-encoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training effective dense retrieval models typically relies on hard negative examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models. We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, the generative pipeline consistently underperforms the 0
What carries the argument
The generative pipeline that prompts an LLM to produce synthetic hard negatives given only a query and a positive passage, then uses those negatives to train a dense retriever without ever accessing the full corpus.
If this is right
- Dense retrieval training still benefits from direct access to the full document collection for negative mining rather than relying solely on generative synthesis.
- Increasing the parameter count of the negative generator does not reliably produce harder or more useful training examples for the retriever.
- Corpus-based mining strategies remain the stronger default for creating effective training data on standard retrieval benchmarks.
Where Pith is reading between the lines
- Hybrid approaches that combine limited corpus mining with LLM generation might close the observed performance gap without requiring full index construction.
- The non-monotonic scaling pattern suggests that prompt quality or output calibration may matter more than raw generator size for this use case.
- Future benchmarks could explicitly measure how well synthetic negatives approximate the distribution of corpus-mined negatives.
Load-bearing premise
The synthetic hard negatives generated by the LLMs from a provided query and positive passage are of comparable quality and hardness to negatives mined from the full corpus.
What would settle it
A direct experiment in which the same DistilBERT model trained on the LLM-generated negatives reaches or exceeds the nDCG@10 scores obtained by the BM25-mined baseline on a majority of the BEIR datasets.
Figures
read the original abstract
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models(LLMs). We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters (Qwen3, LLaMA3, Phi4) and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, find that our generative pipeline consistently underperforms traditional corpus-based mining strategies (BM25 and Cross-Encoder). Furthermore, we observe that scaling the generator model does not monotonically improve retrieval performance and find that the 14B parameter model outperforms the 30B model and in some settings it is the worst performing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes generating synthetic hard negatives for training dense retrievers directly from a query and positive passage using LLMs (Qwen3, LLaMA3, Phi4 ranging 4B-30B parameters), instead of corpus mining with BM25 or cross-encoders. They fine-tune DistilBERT on the resulting data and evaluate on 10 BEIR datasets, claiming consistent underperformance relative to traditional mining and non-monotonic scaling where the 14B model sometimes outperforms the 30B model.
Significance. If the central empirical comparison holds after addressing controls for negative hardness and experimental details, the work would be significant for information retrieval by challenging the assumption that stronger LLMs automatically produce superior synthetic training data and by underscoring the role of corpus-based hardness in effective dense retrieval training. The multi-dataset evaluation on BEIR provides a useful empirical baseline for future generative approaches.
major comments (2)
- [Method] Method section (generative pipeline description): The synthetic negatives are produced solely from the query and positive passage via LLM prompting, without corpus access or explicit ranking to select top distractors. This setup does not replicate the hardness of BM25 or cross-encoder mined negatives (which are the highest-ranked false positives from the full collection), so the reported underperformance on BEIR may reflect easier training signals rather than a fundamental limitation of generation; this assumption is load-bearing for the abstract's central claim.
- [Experiments] Experiments and Results sections: The non-monotonic scaling observation (14B outperforming 30B in some settings) and overall underperformance lack reported details on prompting templates per model, number of generations per query-positive pair, variance across runs, or statistical testing. Without these, it is unclear whether the findings are robust or influenced by generation stochasticity or prompt adherence differences.
minor comments (2)
- [Abstract] Abstract: The sentence 'find that our generative pipeline consistently underperforms' is missing the subject 'we'.
- [Results] Consider adding an analysis or table quantifying negative hardness (e.g., via embedding similarity to positives or retrieval rank in a held-out index) to support the comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below with clarifications on our design choices and commitments to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] Method section (generative pipeline description): The synthetic negatives are produced solely from the query and positive passage via LLM prompting, without corpus access or explicit ranking to select top distractors. This setup does not replicate the hardness of BM25 or cross-encoder mined negatives (which are the highest-ranked false positives from the full collection), so the reported underperformance on BEIR may reflect easier training signals rather than a fundamental limitation of generation; this assumption is load-bearing for the abstract's central claim.
Authors: We agree that our approach generates negatives without corpus access or ranking, which differs from BM25 or cross-encoder mining. This is an intentional design to evaluate a corpus-free alternative that avoids index construction. The consistent underperformance relative to mined negatives demonstrates the difficulty of eliciting sufficiently hard negatives via prompting alone, supporting rather than undermining the central claim. We will revise the abstract and method section to explicitly frame the work as assessing corpus-free generation and to discuss the implications of this hardness gap. revision: partial
-
Referee: [Experiments] Experiments and Results sections: The non-monotonic scaling observation (14B outperforming 30B in some settings) and overall underperformance lack reported details on prompting templates per model, number of generations per query-positive pair, variance across runs, or statistical testing. Without these, it is unclear whether the findings are robust or influenced by generation stochasticity or prompt adherence differences.
Authors: We will add the requested details in the revised manuscript. This includes the exact prompting templates for each model (Qwen3, LLaMA3, Phi4), confirmation that one negative was generated per query-positive pair, standard deviations across multiple runs with different seeds, and statistical significance tests (e.g., paired t-tests) for key performance differences. These additions will address concerns about stochasticity and robustness. revision: yes
Circularity Check
No circularity: empirical evaluation on external benchmarks
full rationale
The paper's central claims rest on a direct experimental pipeline: LLM-prompted generation of synthetic hard negatives from query+positive pairs, followed by fine-tuning DistilBERT and measuring nDCG@10 on the external BEIR benchmark suite. Performance is compared head-to-head against BM25 and cross-encoder negatives mined from the full corpus. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the reported method or results. All outcomes are falsifiable measurements against independent test collections rather than reductions to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption BEIR datasets provide a representative and standard evaluation for dense retrieval performance across domains.
Reference graph
Works this paper leans on
-
[1]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
URLhttps://arxiv.org/abs/1611.09268. Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, Barcelona, Spain, July
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Language Models are Few-Shot Learners
URL https://arxiv.org/abs/2005.14165. Haonan Chen, Zhicheng Dou, Kelong Mao, Jiongnan Liu, and Ziliang Zhao. Generalizing conver- sational dense retrieval via llm-cognition data augmentation,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
URLhttps://arxiv. org/abs/2402.07092. Yujuan Ding, Wenqi Fan, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meets llms: Towards retrieval-augmented large language models,
-
[4]
URLhttps://arxiv.org/abs/2407.21783. Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for smart reply,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Efficient Natural Language Response Suggestion for Smart Reply
URLhttps://arxiv.org/abs/1705.00652. Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. InstructoR: Instructing unsu- pervised conversational dense retrieval with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 6649–6675, Singapore, December
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
doi: 10.18653/v1/2023.findings-emnlp.443
Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.443. URLhttps://aclanthology.org/2023. findings-emnlp.443/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, an...
-
[7]
Dense Passage Retrieval for Open-Domain Question Answering
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. 5 Koray Kavukcuoglu. Gemini 2.5: Our newest gemini model with thinking, 3
-
[8]
URLhttps://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking. [Online; accessed 2025-04-15]. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InP...
work page 2025
-
[9]
Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin
URL https://arxiv.org/abs/2408.10613. Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document rerank- ing with a large language model,
-
[10]
Yao Meng, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jiawei Han
URLhttps://arxiv.org/abs/2305.02156. Yao Meng, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jiawei Han. Augtriever: Unsupervised dense retrieval by scalable data augmentation,
-
[11]
URLhttps:// arxiv.org/abs/1901.04085. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language mod- els are effective text rankers with pairwise ranking prompting. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Findings of the Associa...
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[12]
doi: 10.18653/v1/2024.findings-naacl.97
Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-naacl.97. URLhttps://aclanthology.org/ 2024.findings-naacl.97/. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. I...
-
[13]
doi: 10.18653/v1/2021.naacl-main.466
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.466. URLhttps://aclanthology.org/2021.naacl-main.466/. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natur...
-
[14]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URLhttps://aclanthology.org/D19-1410/. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,
-
[15]
URLhttps://arxiv.org/abs/1910. 01108. Aarush Sinha, Pavan Kumar S, Roshan Balaji, and Nirav Pravinbhai Bhatt. Bica: Effective biomed- ical dense retrieval with citation-aware hard negatives,
work page 1910
-
[16]
Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin
URLhttps://arxiv.org/ abs/2511.08029. Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin. DRAMA: Diverse augmen- tation from large language models to smaller dense retrievers,
-
[17]
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
URL https://arxiv.org/abs/2104.08663. 6 Haonan Wang, Zhiyuan Huang, Yifan Gao, Yifan Deng, Can Ma, and Jianfeng Gao. SyNeg: Syn- thesizing hard negatives from large language models for dense retrieval,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URLhttps://arxiv.org/abs/2112.07577. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval,
-
[20]
URLhttps://arxiv.org/abs/2007.00808. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...
-
[21]
URLhttps://arxiv.org/abs/2505.09388. Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. Optimizing dense retrieval model training with hard negatives. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 1503–1512, New York, NY , USA,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3462880. URLhttps://doi.org/10.1145/3404835.3462880. Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey,
-
[23]
A LLM CONFIGURATION ANDINFERENCE The model was loaded usingvllmKwon et al. (2023) for efficient inference, and all models were configured •Sampling parameters: –Temperature:0.6 –Top-p:0.95 –Top-k:20 –Minimump:0.0 –Maximum tokens:1024 • Tensor parallel size:2&6(for the Qwen3-30B model) • d type:float32 • GPU memory utilization:0.80 B PROMPTS B.0.1 USERPROM...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.