HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval
Pith reviewed 2026-05-25 03:18 UTC · model grok-4.3
The pith
A three-phase training method transfers billion-parameter retrieval performance into 190M-parameter models that recover over 98 percent precision for sponsored search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HARNESS-LM transfers retrieval capability from a fine-tuned billion-parameter teacher model to a sub-600M student encoder by first aligning query representations with an L2 objective and then applying contrastive refinement, recovering over 98 percent of the teacher's precision on a real-world sponsored search benchmark.
What carries the argument
The three-phase sequence of teacher fine-tuning, L2 alignment of query representations, and contrastive refinement of the student encoder.
If this is right
- The 190M-parameter model achieves up to 27 times lower online query-encoder latency and 20 times higher throughput on NVIDIA A100 GPUs.
- Online A/B testing on Bing Ads shows +1 percent revenue, +0.6 percent impressions, and +0.4 percent clicks over the existing production ensemble.
- The same precision recovery holds across multiple settings of the Bing Ads benchmark.
- The empirical study identifies effective choices for alignment objectives, embedding dimensionality, model scale, and optimization strategies.
Where Pith is reading between the lines
- The same L2-plus-contrastive sequence could be tested on non-ads retrieval tasks that also rely on query-to-item matching.
- Production teams might simplify their serving stack by replacing an ensemble of retrievers with one distilled model of this size.
- Further compression below 190 million parameters could be measured to find the point where precision begins to drop sharply.
- The method's success suggests that query-only distillation is sufficient when the downstream task is ranking sponsored results.
Load-bearing premise
That L2 alignment of query representations followed by contrastive refinement will transfer the teacher's retrieval capability to the student encoder with only minimal quality loss on sponsored search data.
What would settle it
A run of the student model on the Bing Ads evaluation benchmark in which the full three-phase training recovers less than 90 percent of the teacher's precision.
Figures
read the original abstract
In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HARNESS-LM (HLM), a three-phase recipe for distilling large SLM-based retrievers into compact student encoders for sponsored search. Phase 1 fine-tunes a billion-parameter teacher; phase 2 applies L2 alignment on query representations; phase 3 performs contrastive refinement. On a Bing Ads benchmark the 190M-parameter model recovers >98% of the teacher's precision while achieving up to 27× lower query-encoder latency and 20× higher throughput; online A/B tests report +1% revenue, +0.6% impressions and +0.4% clicks over the production ensemble.
Significance. If the transfer results hold, the work supplies a concrete, production-validated recipe for deploying sub-600M retrieval models in high-throughput sponsored-search settings. The inclusion of real-world A/B testing on Bing Ads is a clear strength, providing direct evidence of business impact beyond offline metrics.
major comments (3)
- [§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.
- [§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.
- [§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.
minor comments (2)
- [§3.3] Notation for the contrastive loss in §3.3 is introduced without an explicit equation number, making it hard to cross-reference with the ablation tables.
- [Table 2] Table 2 caption does not state the number of runs or random seeds used for the reported means, reducing reproducibility of the latency/throughput figures.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will make revisions to improve the manuscript's clarity and completeness.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.
Authors: We appreciate this point. In the HARNESS-LM framework the ad embeddings are generated once by the fixed teacher and held constant during student training. Phase 2's L2 alignment maps the student's query embeddings into the same space as the teacher's queries, thereby preserving dot-product consistency with the teacher's ad embeddings. Phase 3's contrastive refinement then directly optimizes the student's query-ad similarity scores against the teacher's rankings. We will revise §3.2 to state this explicitly and add a short explanatory paragraph on joint-space preservation. revision: yes
-
Referee: [§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.
Authors: The empirical study in §4.3 does contain the relevant ablations, but we agree that the quantitative results should be cited more explicitly. We will add a concise table (or inline numbers) reporting precision@10 for L2-only, contrastive-only, direct distillation, and the full three-phase recipe so that the abstract claim is directly supported by the data. revision: yes
-
Referee: [§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.
Authors: We agree these details are necessary. The A/B test ran for 14 days on 5 % of production traffic; the reported uplifts were statistically significant (p < 0.05). We will insert the test duration, traffic fraction, and significance values into §5.2. revision: yes
Circularity Check
No circularity: empirical training recipe with external benchmarks
full rationale
The paper describes a three-phase procedure (teacher fine-tuning on SLM, L2 query alignment to student, contrastive refinement) and reports direct empirical outcomes on Bing Ads precision recovery, latency/throughput, and A/B revenue/impression/click lifts. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claims rest on external validation metrics rather than any reduction of outputs to inputs by construction. This matches the default expectation of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020
work page 2020
-
[2]
Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019
work page 2019
-
[3]
Fine-tuning llama for multi-stage text retrieval
Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024
work page 2024
-
[4]
LLM2Vec: Large language models are secretly powerful text encoders
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders, 2024. URL https://arxiv.org/abs/2404.05961. Accepted to COLM 2024
-
[5]
Nv-embed: Improved techniques for training llms as generalist embedding models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=lgsyLSsDRe. Spotlight
work page 2025
-
[6]
Generative representational instruction tuning
Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=BC4lIvfSzv. Poster
work page 2025
-
[7]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025
-
[9]
Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025
-
[10]
EmbeddingGemma: Powerful and Lightweight Text Representations
Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Mmteb: Massive multilingual text embedding benchmark
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=zl3pfz4VCV. Poster
work page 2025
-
[12]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Cho- chowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024
work page 2024
-
[14]
Approximate nearest neighbor negative contrastive learning for dense text retrieval
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2020
work page 2020
-
[15]
Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval
Wenhao Lu, Jian Jiao, and Ruofei Zhang. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), pages 2645–2652, 2020. doi: 10.1145/3340531.3412747. URL https://researchr.org/publication/LuJZ20-0
-
[16]
Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives
Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12028–12037, 2023
work page 2023
-
[17]
Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia. Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machi...
work page 2025
-
[18]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[20]
Matryoshka representation learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...
work page 2022
-
[21]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[22]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020
work page 2020
-
[23]
Hplt 3.0: Very large-scale multilingual resources for llm and mt
Stephan Oepen et al. Hplt 3.0: Very large-scale multilingual resources for llm and mt. mono- and bi-lingual data, multilingual evaluation, and pre-trained models,
-
[24]
URL https://arxiv.org/abs/2511.01066
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
change from pdf into word free
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similar- ity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 09–15 Jun 2019. URL https:/...
work page 2019
-
[26]
Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen
Let 𝑓 𝑆 𝑄 be the student query encoder (Qwen3-0.6B) that we are aligning to the 4B-query encoder. Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen. B.1 KL-based contrastive distillation In the Kullback-Leibler divergence-based loss function defined in [8], the loss function transfers the teacher’sscore distributionover a ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.