pith. sign in

arxiv: 2504.21015 · v4 · submitted 2025-04-20 · 💻 cs.IR · cs.CL

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

Pith reviewed 2026-05-22 19:29 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords dense retrievalhard negativessynthetic dataLLM promptingBEIR benchmarknegative mining
0
0 comments X

The pith

Generating synthetic hard negatives with LLMs for dense retrieval training underperforms traditional corpus-based mining with BM25 and cross-encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether prompting large language models to create hard negative examples directly from a query and positive passage can replace the standard practice of mining negatives from a full document collection. It fine-tunes a DistilBERT retriever on these synthetic negatives across ten BEIR datasets and compares the results against models trained with BM25 or cross-encoder mined negatives. The experiments show that the LLM-generated negatives produce weaker retrieval models in every case examined. They also reveal that increasing the size of the generator model from 4B to 30B parameters does not steadily improve downstream retrieval scores, with a 14B model sometimes performing best and the 30B model sometimes worst.

Core claim

Training effective dense retrieval models typically relies on hard negative examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models. We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, the generative pipeline consistently underperforms the 0

What carries the argument

The generative pipeline that prompts an LLM to produce synthetic hard negatives given only a query and a positive passage, then uses those negatives to train a dense retriever without ever accessing the full corpus.

If this is right

  • Dense retrieval training still benefits from direct access to the full document collection for negative mining rather than relying solely on generative synthesis.
  • Increasing the parameter count of the negative generator does not reliably produce harder or more useful training examples for the retriever.
  • Corpus-based mining strategies remain the stronger default for creating effective training data on standard retrieval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid approaches that combine limited corpus mining with LLM generation might close the observed performance gap without requiring full index construction.
  • The non-monotonic scaling pattern suggests that prompt quality or output calibration may matter more than raw generator size for this use case.
  • Future benchmarks could explicitly measure how well synthetic negatives approximate the distribution of corpus-mined negatives.

Load-bearing premise

The synthetic hard negatives generated by the LLMs from a provided query and positive passage are of comparable quality and hardness to negatives mined from the full corpus.

What would settle it

A direct experiment in which the same DistilBERT model trained on the LLM-generated negatives reaches or exceeds the nDCG@10 scores obtained by the BM25-mined baseline on a majority of the BEIR datasets.

Figures

Figures reproduced from arXiv: 2504.21015 by Aarush Sinha.

Figure 1
Figure 1. Figure 1: Performance analysis of DistilBERT fine-tuned on synthetic hard negatives. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models(LLMs). We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters (Qwen3, LLaMA3, Phi4) and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, find that our generative pipeline consistently underperforms traditional corpus-based mining strategies (BM25 and Cross-Encoder). Furthermore, we observe that scaling the generator model does not monotonically improve retrieval performance and find that the 14B parameter model outperforms the 30B model and in some settings it is the worst performing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes generating synthetic hard negatives for training dense retrievers directly from a query and positive passage using LLMs (Qwen3, LLaMA3, Phi4 ranging 4B-30B parameters), instead of corpus mining with BM25 or cross-encoders. They fine-tune DistilBERT on the resulting data and evaluate on 10 BEIR datasets, claiming consistent underperformance relative to traditional mining and non-monotonic scaling where the 14B model sometimes outperforms the 30B model.

Significance. If the central empirical comparison holds after addressing controls for negative hardness and experimental details, the work would be significant for information retrieval by challenging the assumption that stronger LLMs automatically produce superior synthetic training data and by underscoring the role of corpus-based hardness in effective dense retrieval training. The multi-dataset evaluation on BEIR provides a useful empirical baseline for future generative approaches.

major comments (2)
  1. [Method] Method section (generative pipeline description): The synthetic negatives are produced solely from the query and positive passage via LLM prompting, without corpus access or explicit ranking to select top distractors. This setup does not replicate the hardness of BM25 or cross-encoder mined negatives (which are the highest-ranked false positives from the full collection), so the reported underperformance on BEIR may reflect easier training signals rather than a fundamental limitation of generation; this assumption is load-bearing for the abstract's central claim.
  2. [Experiments] Experiments and Results sections: The non-monotonic scaling observation (14B outperforming 30B in some settings) and overall underperformance lack reported details on prompting templates per model, number of generations per query-positive pair, variance across runs, or statistical testing. Without these, it is unclear whether the findings are robust or influenced by generation stochasticity or prompt adherence differences.
minor comments (2)
  1. [Abstract] Abstract: The sentence 'find that our generative pipeline consistently underperforms' is missing the subject 'we'.
  2. [Results] Consider adding an analysis or table quantifying negative hardness (e.g., via embedding similarity to positives or retrieval rank in a held-out index) to support the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below with clarifications on our design choices and commitments to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (generative pipeline description): The synthetic negatives are produced solely from the query and positive passage via LLM prompting, without corpus access or explicit ranking to select top distractors. This setup does not replicate the hardness of BM25 or cross-encoder mined negatives (which are the highest-ranked false positives from the full collection), so the reported underperformance on BEIR may reflect easier training signals rather than a fundamental limitation of generation; this assumption is load-bearing for the abstract's central claim.

    Authors: We agree that our approach generates negatives without corpus access or ranking, which differs from BM25 or cross-encoder mining. This is an intentional design to evaluate a corpus-free alternative that avoids index construction. The consistent underperformance relative to mined negatives demonstrates the difficulty of eliciting sufficiently hard negatives via prompting alone, supporting rather than undermining the central claim. We will revise the abstract and method section to explicitly frame the work as assessing corpus-free generation and to discuss the implications of this hardness gap. revision: partial

  2. Referee: [Experiments] Experiments and Results sections: The non-monotonic scaling observation (14B outperforming 30B in some settings) and overall underperformance lack reported details on prompting templates per model, number of generations per query-positive pair, variance across runs, or statistical testing. Without these, it is unclear whether the findings are robust or influenced by generation stochasticity or prompt adherence differences.

    Authors: We will add the requested details in the revised manuscript. This includes the exact prompting templates for each model (Qwen3, LLaMA3, Phi4), confirmation that one negative was generated per query-positive pair, standard deviations across multiple runs with different seeds, and statistical significance tests (e.g., paired t-tests) for key performance differences. These additions will address concerns about stochasticity and robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks

full rationale

The paper's central claims rest on a direct experimental pipeline: LLM-prompted generation of synthetic hard negatives from query+positive pairs, followed by fine-tuning DistilBERT and measuring nDCG@10 on the external BEIR benchmark suite. Performance is compared head-to-head against BM25 and cross-encoder negatives mined from the full corpus. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the reported method or results. All outcomes are falsifiable measurements against independent test collections rather than reductions to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard assumptions about benchmark validity and fair comparison of training signals but introduces no new free parameters, axioms beyond domain standards, or invented entities.

axioms (1)
  • domain assumption BEIR datasets provide a representative and standard evaluation for dense retrieval performance across domains.
    The performance claims rest on these benchmarks being appropriate proxies for real-world retrieval tasks.

pith-pipeline@v0.9.0 · 5689 in / 1479 out tokens · 83969 ms · 2026-05-22T19:29:26.702287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 7 internal anchors

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    URLhttps://arxiv.org/abs/1611.09268. Steven Bird and Edward Loper. NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217, Barcelona, Spain, July

  2. [2]

    Language Models are Few-Shot Learners

    URL https://arxiv.org/abs/2005.14165. Haonan Chen, Zhicheng Dou, Kelong Mao, Jiongnan Liu, and Ziliang Zhao. Generalizing conver- sational dense retrieval via llm-cognition data augmentation,

  3. [3]

    org/abs/2402.07092

    URLhttps://arxiv. org/abs/2402.07092. Yujuan Ding, Wenqi Fan, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meets llms: Towards retrieval-augmented large language models,

  4. [4]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for smart reply,

  5. [5]

    Efficient Natural Language Response Suggestion for Smart Reply

    URLhttps://arxiv.org/abs/1705.00652. Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. InstructoR: Instructing unsu- pervised conversational dense retrieval with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 6649–6675, Singapore, December

  6. [6]

    doi: 10.18653/v1/2023.findings-emnlp.443

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.443. URLhttps://aclanthology.org/2023. findings-emnlp.443/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, an...

  7. [7]

    Dense Passage Retrieval for Open-Domain Question Answering

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. 5 Koray Kavukcuoglu. Gemini 2.5: Our newest gemini model with thinking, 3

  8. [8]

    [Online; accessed 2025-04-15]

    URLhttps://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking. [Online; accessed 2025-04-15]. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InP...

  9. [9]

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin

    URL https://arxiv.org/abs/2408.10613. Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document rerank- ing with a large language model,

  10. [10]

    Yao Meng, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jiawei Han

    URLhttps://arxiv.org/abs/2305.02156. Yao Meng, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu, and Jiawei Han. Augtriever: Unsupervised dense retrieval by scalable data augmentation,

  11. [11]

    Passage Re-ranking with BERT

    URLhttps:// arxiv.org/abs/1901.04085. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language mod- els are effective text rankers with pairwise ranking prompting. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Findings of the Associa...

  12. [12]

    doi: 10.18653/v1/2024.findings-naacl.97

    Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-naacl.97. URLhttps://aclanthology.org/ 2024.findings-naacl.97/. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. I...

  13. [13]

    doi: 10.18653/v1/2021.naacl-main.466

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.466. URLhttps://aclanthology.org/2021.naacl-main.466/. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natur...

  14. [14]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URLhttps://aclanthology.org/D19-1410/. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,

  15. [15]

    URLhttps://arxiv.org/abs/1910. 01108. Aarush Sinha, Pavan Kumar S, Roshan Balaji, and Nirav Pravinbhai Bhatt. Bica: Effective biomed- ical dense retrieval with citation-aware hard negatives,

  16. [16]

    Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin

    URLhttps://arxiv.org/ abs/2511.08029. Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin. DRAMA: Diverse augmen- tation from large language models to smaller dense retrievers,

  17. [17]

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

    URL https://arxiv.org/abs/2104.08663. 6 Haonan Wang, Zhiyuan Huang, Yifan Gao, Yifan Deng, Can Ma, and Jianfeng Gao. SyNeg: Syn- thesizing hard negatives from large language models for dense retrieval,

  18. [19]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk

    URLhttps://arxiv.org/abs/2112.07577. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval,

  19. [20]

    URLhttps://arxiv.org/abs/2007.00808. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...

  20. [21]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. Optimizing dense retrieval model training with hard negatives. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 1503–1512, New York, NY , USA,

  21. [22]

    ISBN 9781450380379

    Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3462880. URLhttps://doi.org/10.1145/3404835.3462880. Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey,

  22. [23]

    A LLM CONFIGURATION ANDINFERENCE The model was loaded usingvllmKwon et al. (2023) for efficient inference, and all models were configured •Sampling parameters: –Temperature:0.6 –Top-p:0.95 –Top-k:20 –Minimump:0.0 –Maximum tokens:1024 • Tensor parallel size:2&6(for the Qwen3-30B model) • d type:float32 • GPU memory utilization:0.80 B PROMPTS B.0.1 USERPROM...