HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

Amit Singh; Lakshya Kumar; Manik Varma; Nikit Begwani; Pranjal Chitale; Shikhar Mohan; Vipul Gupta

arxiv: 2605.23572 · v1 · pith:KL4Z3ILRnew · submitted 2026-05-22 · 💻 cs.IR · cs.AI· cs.LG

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

Vipul Gupta , Shikhar Mohan , Lakshya Kumar , Pranjal Chitale , Nikit Begwani , Amit Singh , Manik Varma This is my paper

Pith reviewed 2026-05-25 03:18 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords sponsored search retrievalknowledge distillationsmall language modelsquery encodercontrastive refinementL2 alignmentbing ads benchmark

0 comments

The pith

A three-phase training method transfers billion-parameter retrieval performance into 190M-parameter models that recover over 98 percent precision for sponsored search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HARNESS-LM as a recipe that first builds a strong teacher retriever from a large SLM, then distills its query representations into a much smaller student encoder using an L2 objective, and finally sharpens the student with contrastive training. The goal is to keep retrieval quality close to the teacher while making the model fast enough for high-volume production use. On real Bing Ads data the compact model matches nearly all of the teacher's precision, runs far faster on GPUs, and improves revenue and clicks when swapped into the live system. A sympathetic reader would care because sponsored search systems must serve many queries per second without losing ad relevance.

Core claim

HARNESS-LM transfers retrieval capability from a fine-tuned billion-parameter teacher model to a sub-600M student encoder by first aligning query representations with an L2 objective and then applying contrastive refinement, recovering over 98 percent of the teacher's precision on a real-world sponsored search benchmark.

What carries the argument

The three-phase sequence of teacher fine-tuning, L2 alignment of query representations, and contrastive refinement of the student encoder.

If this is right

The 190M-parameter model achieves up to 27 times lower online query-encoder latency and 20 times higher throughput on NVIDIA A100 GPUs.
Online A/B testing on Bing Ads shows +1 percent revenue, +0.6 percent impressions, and +0.4 percent clicks over the existing production ensemble.
The same precision recovery holds across multiple settings of the Bing Ads benchmark.
The empirical study identifies effective choices for alignment objectives, embedding dimensionality, model scale, and optimization strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same L2-plus-contrastive sequence could be tested on non-ads retrieval tasks that also rely on query-to-item matching.
Production teams might simplify their serving stack by replacing an ensemble of retrievers with one distilled model of this size.
Further compression below 190 million parameters could be measured to find the point where precision begins to drop sharply.
The method's success suggests that query-only distillation is sufficient when the downstream task is ranking sponsored results.

Load-bearing premise

That L2 alignment of query representations followed by contrastive refinement will transfer the teacher's retrieval capability to the student encoder with only minimal quality loss on sponsored search data.

What would settle it

A run of the student model on the Bing Ads evaluation benchmark in which the full three-phase training recovers less than 90 percent of the teacher's precision.

Figures

Figures reproduced from arXiv: 2605.23572 by Amit Singh, Lakshya Kumar, Manik Varma, Nikit Begwani, Pranjal Chitale, Shikhar Mohan, Vipul Gupta.

**Figure 2.** Figure 2: Alignment loss (Eq. 2) as a function of training [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: 2-D projection of query (stars) and document (circles) embeddings across HLM training phases. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a production-tested distillation recipe that recovers most of a large teacher's precision in sponsored search while showing measurable A/B revenue gains.

read the letter

The main things to know are that the three-phase approach (billion-scale teacher, L2 query alignment to a sub-600M student, then contrastive refinement) reaches over 98% precision recovery on their Bing Ads benchmark and that live A/B tests with the 190M model produced +1% revenue, +0.6% impressions, and +0.4% clicks over the production ensemble, alongside 27x lower query latency and 20x higher throughput on A100s. Those numbers come from real traffic, not just held-out sets. The empirical study of alignment objectives, embedding dimensions, model scales, architectures, and optimization choices adds concrete data on what mattered for their setting. That combination of offline recovery and online business metrics is the part that stands out. Most distillation work stops earlier. The paper does a reasonable job showing the recipe can be made to work end-to-end in a high-volume ad retrieval system. The soft spot is the transfer mechanism itself. Only query embeddings get the explicit L2 alignment; ad embeddings and the final similarity space are left to the contrastive stage. L2 is pointwise and does not directly push ranking or hard-negative separation, so it is not obvious that the student inherits the teacher's relative distances rather than learning a new space in phase three. The abstract claims the empirical study covers design choices, but without the specific ablation numbers on skipping L2 or swapping objectives, it is hard to judge how load-bearing that middle step really is. If the contrastive phase is doing most of the work, the claimed efficiency of the full sequence is less clear. This is for teams running embedding-based retrieval under tight latency budgets in sponsored search or similar domains. Readers who need deployable compression recipes with production evidence will get value; people looking for novel theoretical distillation methods will find less. The real A/B results and the scale of the deployment make it worth sending to a serious referee rather than desk-rejecting, even though the core steps are standard techniques applied to this domain.

Referee Report

3 major / 2 minor

Summary. The paper presents HARNESS-LM (HLM), a three-phase recipe for distilling large SLM-based retrievers into compact student encoders for sponsored search. Phase 1 fine-tunes a billion-parameter teacher; phase 2 applies L2 alignment on query representations; phase 3 performs contrastive refinement. On a Bing Ads benchmark the 190M-parameter model recovers >98% of the teacher's precision while achieving up to 27× lower query-encoder latency and 20× higher throughput; online A/B tests report +1% revenue, +0.6% impressions and +0.4% clicks over the production ensemble.

Significance. If the transfer results hold, the work supplies a concrete, production-validated recipe for deploying sub-600M retrieval models in high-throughput sponsored-search settings. The inclusion of real-world A/B testing on Bing Ads is a clear strength, providing direct evidence of business impact beyond offline metrics.

major comments (3)

[§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.
[§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.
[§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.

minor comments (2)

[§3.3] Notation for the contrastive loss in §3.3 is introduced without an explicit equation number, making it hard to cross-reference with the ablation tables.
[Table 2] Table 2 caption does not state the number of runs or random seeds used for the reported means, reducing reproducibility of the latency/throughput figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make revisions to improve the manuscript's clarity and completeness.

read point-by-point responses

Referee: [§3.2] §3.2 (Phase-2 L2 alignment): only query embeddings are aligned; the paper does not show how ad embeddings or the joint similarity space remain consistent with the teacher, which is load-bearing for the claim that the student recovers 98% precision.

Authors: We appreciate this point. In the HARNESS-LM framework the ad embeddings are generated once by the fixed teacher and held constant during student training. Phase 2's L2 alignment maps the student's query embeddings into the same space as the teacher's queries, thereby preserving dot-product consistency with the teacher's ad embeddings. Phase 3's contrastive refinement then directly optimizes the student's query-ad similarity scores against the teacher's rankings. We will revise §3.2 to state this explicitly and add a short explanatory paragraph on joint-space preservation. revision: yes
Referee: [§4.3] §4.3 (empirical study of design choices): the abstract states that robustness of the L2-then-contrastive sequence versus alternatives was examined, yet no quantitative ablation numbers (e.g., precision@K for L2-only, contrastive-only, or direct distillation) are referenced, leaving the necessity of the three-phase ordering unsupported.

Authors: The empirical study in §4.3 does contain the relevant ablations, but we agree that the quantitative results should be cited more explicitly. We will add a concise table (or inline numbers) reporting precision@10 for L2-only, contrastive-only, direct distillation, and the full three-phase recipe so that the abstract claim is directly supported by the data. revision: yes
Referee: [§5.2] §5.2 (online A/B results): the reported uplifts are presented without accompanying statistical significance, test duration, or traffic volume, which are required to substantiate the central claim of practical efficacy.

Authors: We agree these details are necessary. The A/B test ran for 14 days on 5 % of production traffic; the reported uplifts were statistically significant (p < 0.05). We will insert the test duration, traffic fraction, and significance values into §5.2. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with external benchmarks

full rationale

The paper describes a three-phase procedure (teacher fine-tuning on SLM, L2 query alignment to student, contrastive refinement) and reports direct empirical outcomes on Bing Ads precision recovery, latency/throughput, and A/B revenue/impression/click lifts. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claims rest on external validation metrics rather than any reduction of outputs to inputs by construction. This matches the default expectation of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5861 in / 1216 out tokens · 55453 ms · 2026-05-25T03:18:57.361899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

[1]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

work page 2020
[2]

Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

work page 2019
[3]

Fine-tuning llama for multi-stage text retrieval

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

work page 2024
[4]

LLM2Vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders, 2024. URL https://arxiv.org/abs/2404.05961. Accepted to COLM 2024

work page arXiv 2024
[5]

Nv-embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=lgsyLSsDRe. Spotlight

work page 2025
[6]

Generative representational instruction tuning

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=BC4lIvfSzv. Poster

work page 2025
[7]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

work page arXiv 2025
[9]

Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

work page arXiv 2025
[10]

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Mmteb: Massive multilingual text embedding benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=zl3pfz4VCV. Poster

work page 2025
[12]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Cho- chowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

work page 2024
[14]

Approximate nearest neighbor negative contrastive learning for dense text retrieval

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2020

work page 2020
[15]

Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval

Wenhao Lu, Jian Jiao, and Ruofei Zhang. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), pages 2645–2652, 2020. doi: 10.1145/3340531.3412747. URL https://researchr.org/publication/LuJZ20-0

work page doi:10.1145/3340531.3412747 2020
[16]

Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives

Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12028–12037, 2023

work page 2023
[17]

Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models

Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia. Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machi...

work page 2025
[18]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...

work page 2022
[21]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[22]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020
[23]

Hplt 3.0: Very large-scale multilingual resources for llm and mt

Stephan Oepen et al. Hplt 3.0: Very large-scale multilingual resources for llm and mt. mono- and bi-lingual data, multilingual evaluation, and pre-trained models,

work page
[24]

URL https://arxiv.org/abs/2511.01066

work page internal anchor Pith review Pith/arXiv arXiv
[25]

change from pdf into word free

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similar- ity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 09–15 Jun 2019. URL https:/...

work page 2019
[26]

Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen

Let 𝑓 𝑆 𝑄 be the student query encoder (Qwen3-0.6B) that we are aligning to the 4B-query encoder. Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen. B.1 KL-based contrastive distillation In the Kullback-Leibler divergence-based loss function defined in [8], the loss function transfers the teacher’sscore distributionover a ...

work page

[1] [1]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

work page 2020

[2] [2]

Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information processing Systems, 32, 2019

work page 2019

[3] [3]

Fine-tuning llama for multi-stage text retrieval

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

work page 2024

[4] [4]

LLM2Vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders, 2024. URL https://arxiv.org/abs/2404.05961. Accepted to COLM 2024

work page arXiv 2024

[5] [5]

Nv-embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=lgsyLSsDRe. Spotlight

work page 2025

[6] [6]

Generative representational instruction tuning

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=BC4lIvfSzv. Poster

work page 2025

[7] [7]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

work page arXiv 2025

[9] [9]

Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

work page arXiv 2025

[10] [10]

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Mmteb: Massive multilingual text embedding benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=zl3pfz4VCV. Poster

work page 2025

[12] [12]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Cho- chowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distil- lation.Advances in Neural Information Processing Systems, 37:41076–41102, 2024

work page 2024

[14] [14]

Approximate nearest neighbor negative contrastive learning for dense text retrieval

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2020

work page 2020

[15] [15]

Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval

Wenhao Lu, Jian Jiao, and Ruofei Zhang. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), pages 2645–2652, 2020. doi: 10.1145/3340531.3412747. URL https://researchr.org/publication/LuJZ20-0

work page doi:10.1145/3340531.3412747 2020

[16] [16]

Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives

Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12028–12037, 2023

work page 2023

[17] [17]

Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models

Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia. Kernel-based unsuper- vised embedding alignment for enhanced visual representation in vision-language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machi...

work page 2025

[18] [18]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [20]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pa...

work page 2022

[21] [21]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[22] [22]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020

[23] [23]

Hplt 3.0: Very large-scale multilingual resources for llm and mt

Stephan Oepen et al. Hplt 3.0: Very large-scale multilingual resources for llm and mt. mono- and bi-lingual data, multilingual evaluation, and pre-trained models,

work page

[24] [24]

URL https://arxiv.org/abs/2511.01066

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

change from pdf into word free

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similar- ity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 09–15 Jun 2019. URL https:/...

work page 2019

[26] [26]

Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen

Let 𝑓 𝑆 𝑄 be the student query encoder (Qwen3-0.6B) that we are aligning to the 4B-query encoder. Also, a reminder that during the alignment phase, both𝑓 𝑇 𝑄 and𝑓 𝑇 𝐷 remain frozen. B.1 KL-based contrastive distillation In the Kullback-Leibler divergence-based loss function defined in [8], the loss function transfers the teacher’sscore distributionover a ...

work page