arxiv: 2604.04734 · v2 · submitted 2026-04-06 · 💻 cs.IR

Recognition: 2 theorem links

· Lean Theorem

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Youngjoon Jang , Seongtae Hong , Hyeonseok Moon , Heuiseok Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.IR

keywords knowledge distillationdense retrievalscore distributionstratified samplinghard negativescross-encoder teachernegative sampling

0 comments

The pith

Stratified sampling of teacher scores preserves full preference structure in distillation for dense retrieval

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that knowledge distillation for dense retrieval models has overemphasized mining hard negatives at the expense of the teacher's overall score distribution. When students see only the hardest examples, they fail to internalize the broader set of relative preferences the teacher encodes across the entire score spectrum. The authors therefore introduce stratified sampling, which partitions the score range and draws examples uniformly to keep the original variance and entropy intact. Experiments on in-domain and out-of-domain benchmarks show this simple change yields consistent gains over both top-K hard-negative selection and random sampling.

Core claim

The central claim is that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher. Stratified Sampling achieves this by uniformly covering the entire teacher score spectrum, thereby maintaining the statistical properties that reflect comprehensive preference structure and producing stronger generalization than methods focused solely on hard negatives.

What carries the argument

Stratified Sampling strategy, which divides the teacher score spectrum into strata and samples uniformly across them to replicate the original distribution's variance and entropy.

If this is right

Stratified Sampling serves as a robust baseline that significantly outperforms top-K and random sampling in diverse in-domain and out-of-domain settings.
Maintaining teacher score variance and entropy improves the student's ability to generalize ranking preferences beyond the training domain.
The focus in distillation should shift from exclusive hard-negative mining toward emulating the full spectrum of relative scores assigned by the teacher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distribution-preserving sampling could be tested in other teacher-student ranking or preference-learning tasks outside retrieval.
Stratified sampling might be combined with existing hard-negative techniques to achieve further additive gains.
Monitoring score-distribution statistics during training could serve as a diagnostic for whether distillation is capturing the teacher's full preference information.

Load-bearing premise

That preserving the statistical properties of the teacher score distribution through stratified sampling is sufficient to transfer the comprehensive preference structure without introducing new biases or requiring other training changes.

What would settle it

An experiment in which a model trained with stratified sampling shows no improvement or worse performance than a top-K hard-negative baseline on standard retrieval benchmarks such as MS MARCO or BEIR would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.04734 by Heuiseok Lim, Hyeonseok Moon, Seongtae Hong, Youngjoon Jang.

**Figure 2.** Figure 2: Retrieval performance (nDCG@10) on TREC DL 19 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stratified sampling to match the teacher's full score distribution is a sensible practical adjustment for KD in dense retrieval, but the gains may trace to binning choices rather than distribution preservation alone.

read the letter

The paper's core point is that hard-negative mining alone leaves the student blind to the teacher's broader score preferences, and that sampling negatives to keep the full variance and entropy of teacher scores improves results. They introduce stratified sampling across the score spectrum as the way to do this and report that it beats both top-K hard negatives and random sampling on in-domain and out-of-domain retrieval benchmarks.

Referee Report

2 major / 2 minor

Summary. The paper claims that knowledge distillation for dense retrieval has overemphasized hard-negative mining at the expense of the teacher's full score distribution; it proposes stratified sampling to uniformly cover the teacher score spectrum (preserving variance and entropy) and reports that this simple baseline significantly outperforms both top-K and random sampling on in-domain and out-of-domain benchmarks.

Significance. If the central empirical claim holds, the work usefully redirects attention from hard-negative selection alone to the broader problem of emulating the teacher's preference structure via score distribution. The proposal of a lightweight sampling method that requires no other training changes and the confirmatory experiments across multiple benchmarks constitute a clear, falsifiable contribution that could serve as a new standard baseline in the KD-for-retrieval literature.

major comments (2)

[Section 3.2] Section 3.2 (Stratified Sampling): the method requires explicit choices for the number of strata and the binning procedure (equal-width vs. quantile, boundary definitions). These hyperparameters are absent from the top-K and random baselines yet are not accompanied by a sensitivity analysis or default parameter-free rule; without such evidence it is unclear whether reported gains arise from distribution preservation or from the particular strata configuration.
[Section 4] Section 4 (Experiments): the manuscript provides no details on the exact number of strata, bin-boundary selection, total negative count per query relative to baselines, or statistical significance testing of the reported improvements. These omissions are load-bearing because the central claim is that stratified sampling is a robust, superior baseline; the current evidence remains preliminary.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly name the concrete benchmarks and report the magnitude of the observed gains (e.g., nDCG@10 deltas) rather than stating only that the method 'significantly outperforms'.
[Section 3] Notation for teacher scores and strata boundaries should be introduced once with a clear equation or pseudocode block to avoid ambiguity when the method is compared to top-K sampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional clarity and analysis would strengthen the paper. We agree that the description of stratified sampling and the experimental reporting require more detail. We will revise the manuscript accordingly to address both major comments.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (Stratified Sampling): the method requires explicit choices for the number of strata and the binning procedure (equal-width vs. quantile, boundary definitions). These hyperparameters are absent from the top-K and random baselines yet are not accompanied by a sensitivity analysis or default parameter-free rule; without such evidence it is unclear whether reported gains arise from distribution preservation or from the particular strata configuration.

Authors: We agree that Section 3.2 would be improved by explicitly stating the hyperparameter choices and by including a sensitivity analysis. The current manuscript does not provide these. In the revision we will specify the default configuration used (10 equal-width strata over the teacher score range) and add a sensitivity study varying the number of strata (5–20) and comparing equal-width versus quantile binning. We will also state a simple default rule (strata count = min(10, negatives/10)) so that the method is reproducible without arbitrary tuning. This will allow readers to verify that gains are driven by distribution preservation rather than a narrow hyperparameter choice. revision: yes
Referee: [Section 4] Section 4 (Experiments): the manuscript provides no details on the exact number of strata, bin-boundary selection, total negative count per query relative to baselines, or statistical significance testing of the reported improvements. These omissions are load-bearing because the central claim is that stratified sampling is a robust, superior baseline; the current evidence remains preliminary.

Authors: We acknowledge that Section 4 currently lacks these implementation and statistical details. In the revised manuscript we will report: (i) number of strata = 10, (ii) bin boundaries obtained by equal-width partitioning of the observed teacher score range, (iii) identical total negative count per query (100) for stratified, top-K, and random sampling to ensure fair comparison, and (iv) statistical significance via paired t-tests across five random seeds, with p-values reported for all main results. These additions will make the experimental evidence more complete and directly support the claim that stratified sampling is a robust baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal evaluated on external benchmarks

full rationale

The paper proposes Stratified Sampling to preserve teacher score variance and entropy in knowledge distillation for dense retrieval, then evaluates the method empirically against top-K and random sampling on in-domain and out-of-domain benchmarks. No equations, derivations, or fitted parameters are present that reduce any result to its own inputs by construction. The central claim rests on experimental comparisons rather than self-referential definitions or load-bearing self-citations, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper operates within the standard knowledge distillation framework for retrieval and introduces no new free parameters, axioms, or invented entities beyond the proposed sampling procedure itself.

pith-pipeline@v0.9.0 · 5463 in / 1095 out tokens · 58399 ms · 2026-05-10T20:03:32.877604+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stratified Sampling strategy that uniformly covers the entire score spectrum... preserves the variance and entropy of teacher scores
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

simple Stratified Sampling can simultaneously improve both in-domain and out-of-domain performance without complex curriculum scheduling... parameter-free baseline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 3 internal anchors

[1]

1999.Modern information retrieval

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al . 1999.Modern information retrieval. Vol. 463. ACM press New York

1999
[2]

2010.Introduction to modern information retrieval

Gobinda G Chowdhury. 2010.Introduction to modern information retrieval. Facet publishing

2010
[3]

Benjamin Clavié. 2025. JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources.Journal of Natural Language Processing32, 1 (2025), 176–218

2025
[4]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR] https://arxiv.org/abs/2003.07820

work page arXiv 2020
[5]

John R Hershey and Peder A Olsen. 2007. Approximating the Kullback Leibler di- vergence between Gaussian mixture models. In2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4. IEEE, IV–317

2007
[6]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving efficient neural ranking models with cross- architecture knowledge distillation.arXiv preprint arXiv:2010.02666(2020)

work page arXiv 2020
[8]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. InProceedings of the 44th international ACM SIGIR confer- ence on research and development in information retrieval. 113–122

2021
[9]

Chao-Wei Huang and Yun-Nung Chen. 2024. PairDistill: Pairwise relevance distillation for dense retrieval.arXiv preprint arXiv:2410.01383(2024)

work page arXiv 2024
[10]

Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Wang, Nan Wang, and Han Xiao
[11]

InProceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

Jina-ColBERT-v2: A general-purpose multilingual late interaction retriever. InProceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024). 159–166

2024
[12]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

2020
[13]

Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veer- anjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. 2023. EmbedDistill: A geometric knowledge distillation for information retrieval.arXiv preprint arXiv:2301.12005(2023)

work page arXiv 2023
[14]

Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web.ACM computing surveys (CSUR)32, 2 (2000), 144–173

2000
[15]

Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen tau Yih, and Xilun Chen. 2023. How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. arXiv:2302.07452 [cs.IR] https://arxiv.org/abs/2302.07452

work page arXiv 2023
[16]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021). 163–173

2021
[17]

Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, et al. 2023. Prod: Progressive distillation for dense retrieval. InProceedings of the ACM Web Conference 2023. 3299–3308

2023
[18]

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine- Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319 [cs.IR] https: //arxiv.org/abs/2310.08319

work page arXiv 2023
[19]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1101–1104

2019
[20]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

2016
[21]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019)

work page internal anchor Pith review arXiv 2019
[22]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model.arXiv preprint arXiv:2003.06713(2020)

work page arXiv 2020
[23]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question An- swering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

work page doi:10.18653/v1/2021.naacl-main.466 2021
[24]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking.arXiv preprint arXiv:2110.07367 (2021)

work page arXiv 2021
[26]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

2022
[27]

Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. 2013. Building structures from classifiers for passage reranking. InProceedings of the 22nd ACM international conference on Information & Knowledge Management. 969–978. SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Jang et al

2013
[28]

Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. 2013. Learn- ing adaptable patterns for passage reranking. InProceedings of the Seventeenth Conference on Computational Natural Language Learning. 75–83

2013
[29]

Amit Singhal et al. 2001. Modern information retrieval: A brief overview.IEEE Data Eng. Bull.24, 4 (2001), 35–43

2001
[30]

Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, and Jimmy Lin. 2025. Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data.arXiv preprint arXiv:2505.19274(2025)

work page arXiv 2025
[31]

Chongyang Tao, Chang Liu, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, and Daxin Jiang. 2024. Adam: Dense retrieval distillation with adaptive dark examples. InFindings of the Association for Computational Linguistics: ACL 2024. 11639– 11651

2024
[32]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview. net/forum?id=wCu6T5xFjeJ

2021
[33]

Kateryna Tymoshenko and Alessandro Moschitti. 2015. Assessing the impact of syntactic and semantic structures for answer passages reranking. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management. 1451–1460

2015
[34]

Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: Gen- erative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. 2345–2360

2022
[35]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023. SimLM: Pre-training with Represen- tation Bottleneck for Dense Passage Retrieval. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki...

work page doi:10.18653/v1/2023.acl-long.125 2023
[36]

Hansi Zeng, Julian Killingback, and Hamed Zamani. 2025. Scaling Sparse and Dense Retrieval in Decoder-Only LLMs. arXiv:2502.15526 [cs.IR] https://arxiv. org/abs/2502.15526

work page arXiv 2025
[37]

Hansi Zeng, Hamed Zamani, and Vishwa Vinay. 2022. Curriculum learning for dense retrieval distillation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1979–1983

2022
[38]

George Zerveas, Navid Rekabsaz, Daniel Cohen, and Carsten Eickhoff. 2022. CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking. InProceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 10626–10644. https://doi.org/10.18653/v1/2022...

work page doi:10.18653/v1/2022.emnlp-main.727 2022
[39]

Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Adversarial retriever-ranker for dense text retrieval.arXiv preprint arXiv:2110.03611(2021)

work page arXiv 2021