pith. machine review for the scientific record. sign in

arxiv: 2604.04734 · v2 · submitted 2026-04-06 · 💻 cs.IR

Recognition: 2 theorem links

· Lean Theorem

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.IR
keywords knowledge distillationdense retrievalscore distributionstratified samplinghard negativescross-encoder teachernegative sampling
0
0 comments X

The pith

Stratified sampling of teacher scores preserves full preference structure in distillation for dense retrieval

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that knowledge distillation for dense retrieval models has overemphasized mining hard negatives at the expense of the teacher's overall score distribution. When students see only the hardest examples, they fail to internalize the broader set of relative preferences the teacher encodes across the entire score spectrum. The authors therefore introduce stratified sampling, which partitions the score range and draws examples uniformly to keep the original variance and entropy intact. Experiments on in-domain and out-of-domain benchmarks show this simple change yields consistent gains over both top-K hard-negative selection and random sampling.

Core claim

The central claim is that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher. Stratified Sampling achieves this by uniformly covering the entire teacher score spectrum, thereby maintaining the statistical properties that reflect comprehensive preference structure and producing stronger generalization than methods focused solely on hard negatives.

What carries the argument

Stratified Sampling strategy, which divides the teacher score spectrum into strata and samples uniformly across them to replicate the original distribution's variance and entropy.

If this is right

  • Stratified Sampling serves as a robust baseline that significantly outperforms top-K and random sampling in diverse in-domain and out-of-domain settings.
  • Maintaining teacher score variance and entropy improves the student's ability to generalize ranking preferences beyond the training domain.
  • The focus in distillation should shift from exclusive hard-negative mining toward emulating the full spectrum of relative scores assigned by the teacher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distribution-preserving sampling could be tested in other teacher-student ranking or preference-learning tasks outside retrieval.
  • Stratified sampling might be combined with existing hard-negative techniques to achieve further additive gains.
  • Monitoring score-distribution statistics during training could serve as a diagnostic for whether distillation is capturing the teacher's full preference information.

Load-bearing premise

That preserving the statistical properties of the teacher score distribution through stratified sampling is sufficient to transfer the comprehensive preference structure without introducing new biases or requiring other training changes.

What would settle it

An experiment in which a model trained with stratified sampling shows no improvement or worse performance than a top-K hard-negative baseline on standard retrieval benchmarks such as MS MARCO or BEIR would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.04734 by Heuiseok Lim, Hyeonseok Moon, Seongtae Hong, Youngjoon Jang.

Figure 1
Figure 1. Figure 1: Illustration of candidate selection across different [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Retrieval performance (nDCG@10) on TREC DL 19 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that knowledge distillation for dense retrieval has overemphasized hard-negative mining at the expense of the teacher's full score distribution; it proposes stratified sampling to uniformly cover the teacher score spectrum (preserving variance and entropy) and reports that this simple baseline significantly outperforms both top-K and random sampling on in-domain and out-of-domain benchmarks.

Significance. If the central empirical claim holds, the work usefully redirects attention from hard-negative selection alone to the broader problem of emulating the teacher's preference structure via score distribution. The proposal of a lightweight sampling method that requires no other training changes and the confirmatory experiments across multiple benchmarks constitute a clear, falsifiable contribution that could serve as a new standard baseline in the KD-for-retrieval literature.

major comments (2)
  1. [Section 3.2] Section 3.2 (Stratified Sampling): the method requires explicit choices for the number of strata and the binning procedure (equal-width vs. quantile, boundary definitions). These hyperparameters are absent from the top-K and random baselines yet are not accompanied by a sensitivity analysis or default parameter-free rule; without such evidence it is unclear whether reported gains arise from distribution preservation or from the particular strata configuration.
  2. [Section 4] Section 4 (Experiments): the manuscript provides no details on the exact number of strata, bin-boundary selection, total negative count per query relative to baselines, or statistical significance testing of the reported improvements. These omissions are load-bearing because the central claim is that stratified sampling is a robust, superior baseline; the current evidence remains preliminary.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly name the concrete benchmarks and report the magnitude of the observed gains (e.g., nDCG@10 deltas) rather than stating only that the method 'significantly outperforms'.
  2. [Section 3] Notation for teacher scores and strata boundaries should be introduced once with a clear equation or pseudocode block to avoid ambiguity when the method is compared to top-K sampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional clarity and analysis would strengthen the paper. We agree that the description of stratified sampling and the experimental reporting require more detail. We will revise the manuscript accordingly to address both major comments.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Stratified Sampling): the method requires explicit choices for the number of strata and the binning procedure (equal-width vs. quantile, boundary definitions). These hyperparameters are absent from the top-K and random baselines yet are not accompanied by a sensitivity analysis or default parameter-free rule; without such evidence it is unclear whether reported gains arise from distribution preservation or from the particular strata configuration.

    Authors: We agree that Section 3.2 would be improved by explicitly stating the hyperparameter choices and by including a sensitivity analysis. The current manuscript does not provide these. In the revision we will specify the default configuration used (10 equal-width strata over the teacher score range) and add a sensitivity study varying the number of strata (5–20) and comparing equal-width versus quantile binning. We will also state a simple default rule (strata count = min(10, negatives/10)) so that the method is reproducible without arbitrary tuning. This will allow readers to verify that gains are driven by distribution preservation rather than a narrow hyperparameter choice. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): the manuscript provides no details on the exact number of strata, bin-boundary selection, total negative count per query relative to baselines, or statistical significance testing of the reported improvements. These omissions are load-bearing because the central claim is that stratified sampling is a robust, superior baseline; the current evidence remains preliminary.

    Authors: We acknowledge that Section 4 currently lacks these implementation and statistical details. In the revised manuscript we will report: (i) number of strata = 10, (ii) bin boundaries obtained by equal-width partitioning of the observed teacher score range, (iii) identical total negative count per query (100) for stratified, top-K, and random sampling to ensure fair comparison, and (iv) statistical significance via paired t-tests across five random seeds, with p-values reported for all main results. These additions will make the experimental evidence more complete and directly support the claim that stratified sampling is a robust baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal evaluated on external benchmarks

full rationale

The paper proposes Stratified Sampling to preserve teacher score variance and entropy in knowledge distillation for dense retrieval, then evaluates the method empirically against top-K and random sampling on in-domain and out-of-domain benchmarks. No equations, derivations, or fitted parameters are present that reduce any result to its own inputs by construction. The central claim rests on experimental comparisons rather than self-referential definitions or load-bearing self-citations, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper operates within the standard knowledge distillation framework for retrieval and introduces no new free parameters, axioms, or invented entities beyond the proposed sampling procedure itself.

pith-pipeline@v0.9.0 · 5463 in / 1095 out tokens · 58399 ms · 2026-05-10T20:03:32.877604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    1999.Modern information retrieval

    Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al . 1999.Modern information retrieval. Vol. 463. ACM press New York

  2. [2]

    2010.Introduction to modern information retrieval

    Gobinda G Chowdhury. 2010.Introduction to modern information retrieval. Facet publishing

  3. [3]

    Benjamin Clavié. 2025. JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources.Journal of Natural Language Processing32, 1 (2025), 176–218

  4. [4]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR] https://arxiv.org/abs/2003.07820

  5. [5]

    John R Hershey and Peder A Olsen. 2007. Approximating the Kullback Leibler di- vergence between Gaussian mixture models. In2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4. IEEE, IV–317

  6. [6]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

  7. [7]

    Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving efficient neural ranking models with cross- architecture knowledge distillation.arXiv preprint arXiv:2010.02666(2020)

  8. [8]

    Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. InProceedings of the 44th international ACM SIGIR confer- ence on research and development in information retrieval. 113–122

  9. [9]

    Chao-Wei Huang and Yun-Nung Chen. 2024. PairDistill: Pairwise relevance distillation for dense retrieval.arXiv preprint arXiv:2410.01383(2024)

  10. [10]

    Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Wang, Nan Wang, and Han Xiao

  11. [11]

    InProceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

    Jina-ColBERT-v2: A general-purpose multilingual late interaction retriever. InProceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024). 159–166

  12. [12]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

  13. [13]

    Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veer- anjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. 2023. EmbedDistill: A geometric knowledge distillation for information retrieval.arXiv preprint arXiv:2301.12005(2023)

  14. [14]

    Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web.ACM computing surveys (CSUR)32, 2 (2000), 144–173

  15. [15]

    Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen tau Yih, and Xilun Chen. 2023. How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. arXiv:2302.07452 [cs.IR] https://arxiv.org/abs/2302.07452

  16. [16]

    Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173

  17. [17]

    Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, et al. 2023. Prod: Progressive distillation for dense retrieval. InProceedings of the ACM Web Conference 2023. 3299–3308

  18. [18]

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine- Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319 [cs.IR] https: //arxiv.org/abs/2310.08319

  19. [19]

    Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1101–1104

  20. [20]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

  21. [21]

    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019)

  22. [22]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model.arXiv preprint arXiv:2003.06713(2020)

  23. [23]

    Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question An- swering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

  24. [24]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

  25. [25]

    Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking.arXiv preprint arXiv:2110.07367 (2021)

  26. [26]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

  27. [27]

    Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. 2013. Building structures from classifiers for passage reranking. InProceedings of the 22nd ACM international conference on Information & Knowledge Management. 969–978. SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Jang et al

  28. [28]

    Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. 2013. Learn- ing adaptable patterns for passage reranking. InProceedings of the Seventeenth Conference on Computational Natural Language Learning. 75–83

  29. [29]

    Amit Singhal et al. 2001. Modern information retrieval: A brief overview.IEEE Data Eng. Bull.24, 4 (2001), 35–43

  30. [30]

    Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin. 2025. Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data.arXiv preprint arXiv:2505.19274(2025)

  31. [31]

    Chongyang Tao, Chang Liu, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, and Daxin Jiang. 2024. Adam: Dense retrieval distillation with adaptive dark examples. InFindings of the Association for Computational Linguistics: ACL 2024. 11639– 11651

  32. [32]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview. net/forum?id=wCu6T5xFjeJ

  33. [33]

    Kateryna Tymoshenko and Alessandro Moschitti. 2015. Assessing the impact of syntactic and semantic structures for answer passages reranking. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management. 1451–1460

  34. [34]

    Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: Gen- erative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2345–2360

  35. [35]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023. SimLM: Pre-training with Represen- tation Bottleneck for Dense Passage Retrieval. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki...

  36. [36]

    Hansi Zeng, Julian Killingback, and Hamed Zamani. 2025. Scaling Sparse and Dense Retrieval in Decoder-Only LLMs. arXiv:2502.15526 [cs.IR] https://arxiv. org/abs/2502.15526

  37. [37]

    Hansi Zeng, Hamed Zamani, and Vishwa Vinay. 2022. Curriculum learning for dense retrieval distillation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1979–1983

  38. [38]

    George Zerveas, Navid Rekabsaz, Daniel Cohen, and Carsten Eickhoff. 2022. CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking. InProceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 10626–10644. https://doi.org/10.18653/v1/2022...

  39. [39]

    Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Adversarial retriever-ranker for dense text retrieval.arXiv preprint arXiv:2110.03611(2021)