pith. sign in

arxiv: 2605.23572 · v1 · pith:KL4Z3ILRnew · submitted 2026-05-22 · 💻 cs.IR · cs.AI· cs.LG

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

Pith reviewed 2026-05-25 03:18 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG
keywords sponsored search retrievalknowledge distillationsmall language modelsquery encodercontrastive refinementL2 alignmentbing ads benchmark
0
0 comments X

The pith

A three-phase training method transfers billion-parameter retrieval performance into 190M-parameter models that recover over 98 percent precision for sponsored search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HARNESS-LM as a recipe that first builds a strong teacher retriever from a large SLM, then distills its query representations into a much smaller student encoder using an L2 objective, and finally sharpens the student with contrastive training. The goal is to keep retrieval quality close to the teacher while making the model fast enough for high-volume production use. On real Bing Ads data the compact model matches nearly all of the teacher's precision, runs far faster on GPUs, and improves revenue and clicks when swapped into the live system. A sympathetic reader would care because sponsored search systems must serve many queries per second without losing ad relevance.

Core claim

HARNESS-LM transfers retrieval capability from a fine-tuned billion-parameter teacher model to a sub-600M student encoder by first aligning query representations with an L2 objective and then applying contrastive refinement, recovering over 98 percent of the teacher's precision on a real-world sponsored search benchmark.

What carries the argument

The three-phase sequence of teacher fine-tuning, L2 alignment of query representations, and contrastive refinement of the student encoder.

If this is right

  • The 190M-parameter model achieves up to 27 times lower online query-encoder latency and 20 times higher throughput on NVIDIA A100 GPUs.
  • Online A/B testing on Bing Ads shows +1 percent revenue, +0.6 percent impressions, and +0.4 percent clicks over the existing production ensemble.
  • The same precision recovery holds across multiple settings of the Bing Ads benchmark.
  • The empirical study identifies effective choices for alignment objectives, embedding dimensionality, model scale, and optimization strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same L2-plus-contrastive sequence could be tested on non-ads retrieval tasks that also rely on query-to-item matching.
  • Production teams might simplify their serving stack by replacing an ensemble of retrievers with one distilled model of this size.
  • Further compression below 190 million parameters could be measured to find the point where precision begins to drop sharply.
  • The method's success suggests that query-only distillation is sufficient when the downstream task is ranking sponsored results.

Load-bearing premise

That L2 alignment of query representations followed by contrastive refinement will transfer the teacher's retrieval capability to the student encoder with only minimal quality loss on sponsored search data.

What would settle it

A run of the student model on the Bing Ads evaluation benchmark in which the full three-phase training recovers less than 90 percent of the teacher's precision.

Figures

Figures reproduced from arXiv: 2605.23572 by Amit Singh, Lakshya Kumar, Manik Varma, Nikit Begwani, Pranjal Chitale, Shikhar Mohan, Vipul Gupta.

Figure 1
Figure 1. Figure 1: HLM: A three-phase training framework for developing effective and compact SLM retrievers. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment loss (Eq. 2) as a function of training [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 2-D projection of query (stars) and document (circles) embeddings across HLM training phases. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents HARNESS-LM (HLM), a three-phase recipe for distilling large SLM-based retrievers into compact student encoders for sponsored search. Phase 1 fine-tunes a billion-parameter teacher; phase 2 applies L2 alignment on query representations; phase 3 performs contrastive refinement. On a Bing Ads benchmark the 190M-parameter model recovers >98% of the teacher's precision while achieving up to 27× lower query-encoder latency and 20× higher throughput; online A/B tests report +1% revenue, +0.6% impressions and +0.4% clicks over the production ensemble.

Significance. If the transfer results hold, the work supplies a concrete, production-validated recipe for deploying sub-600M retrieval models in high-throughput sponsored-search settings. The inclusion of real-world A/B testing on Bing Ads is a clear strength, providing direct evidence of business impact beyond offline metrics.

major comments (3)
  1. [§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.
  2. [§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.
  3. [§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.
minor comments (2)
  1. [§3.3] Notation for the contrastive loss in §3.3 is introduced without an explicit equation number, making it hard to cross-reference with the ablation tables.
  2. [Table 2] Table 2 caption does not state the number of runs or random seeds used for the reported means, reducing reproducibility of the latency/throughput figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make revisions to improve the manuscript's clarity and completeness.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.

    Authors: We appreciate this point. In the HARNESS-LM framework the ad embeddings are generated once by the fixed teacher and held constant during student training. Phase 2's L2 alignment maps the student's query embeddings into the same space as the teacher's queries, thereby preserving dot-product consistency with the teacher's ad embeddings. Phase 3's contrastive refinement then directly optimizes the student's query-ad similarity scores against the teacher's rankings. We will revise §3.2 to state this explicitly and add a short explanatory paragraph on joint-space preservation. revision: yes

  2. Referee: [§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.

    Authors: The empirical study in §4.3 does contain the relevant ablations, but we agree that the quantitative results should be cited more explicitly. We will add a concise table (or inline numbers) reporting precision@10 for L2-only, contrastive-only, direct distillation, and the full three-phase recipe so that the abstract claim is directly supported by the data. revision: yes

  3. Referee: [§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.

    Authors: We agree these details are necessary. The A/B test ran for 14 days on 5 % of production traffic; the reported uplifts were statistically significant (p < 0.05). We will insert the test duration, traffic fraction, and significance values into §5.2. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with external benchmarks

full rationale

The paper describes a three-phase procedure (teacher fine-tuning on SLM, L2 query alignment to student, contrastive refinement) and reports direct empirical outcomes on Bing Ads precision recovery, latency/throughput, and A/B revenue/impression/click lifts. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claims rest on external validation metrics rather than any reduction of outputs to inputs by construction. This matches the default expectation of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5861 in / 1216 out tokens · 55453 ms · 2026-05-25T03:18:57.361899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

  2. [2]

    Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

  3. [3]

    Fine-tuning llama for multi-stage text retrieval

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

  4. [4]

    LLM2Vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders, 2024. URL https://arxiv.org/abs/2404.05961. Accepted to COLM 2024

  5. [5]

    Nv-embed: Improved techniques for training llms as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=lgsyLSsDRe. Spotlight

  6. [6]

    Generative representational instruction tuning

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=BC4lIvfSzv. Poster

  7. [7]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  8. [8]

    Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

    Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

  9. [9]

    Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

    Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

  10. [10]

    EmbeddingGemma: Powerful and Lightweight Text Representations

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

  11. [11]

    Mmteb: Massive multilingual text embedding benchmark

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=zl3pfz4VCV. Poster

  12. [12]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  13. [13]

    Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Cho- chowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

  14. [14]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2020

  15. [15]

    Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval

    Wenhao Lu, Jian Jiao, and Ruofei Zhang. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), pages 2645–2652, 2020. doi: 10.1145/3340531.3412747. URL https://researchr.org/publication/LuJZ20-0

  16. [16]

    Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives

    Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12028–12037, 2023

  17. [17]

    Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models

    Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia. Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machi...

  18. [18]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  19. [19]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  20. [20]

    Matryoshka representation learning

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...

  21. [21]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  22. [22]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  23. [23]

    Hplt 3.0: Very large-scale multilingual resources for llm and mt

    Stephan Oepen et al. Hplt 3.0: Very large-scale multilingual resources for llm and mt. mono- and bi-lingual data, multilingual evaluation, and pre-trained models,

  24. [24]

    URL https://arxiv.org/abs/2511.01066

  25. [25]

    change from pdf into word free

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similar- ity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 09–15 Jun 2019. URL https:/...

  26. [26]

    Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen

    Let 𝑓 𝑆 𝑄 be the student query encoder (Qwen3-0.6B) that we are aligning to the 4B-query encoder. Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen. B.1 KL-based contrastive distillation In the Kullback-Leibler divergence-based loss function defined in [8], the loss function transfers the teacher’sscore distributionover a ...