pith. sign in

arxiv: 2509.12539 · v2 · submitted 2025-09-16 · 💻 cs.IR · cs.CL· cs.LG

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Pith reviewed 2026-05-18 17:08 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG
keywords knowledge distillationtext embeddingsinformation retrievalasymmetric retrievalmodel compressionBEIR benchmarkMTEBembedding alignment
0
0 comments X

The pith

Distilling text embeddings while aligning them to the teacher enables small models to pair with large ones for better retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling text embedding models while aligning them to their teachers produces compact models suitable for flexible use in retrieval. A sympathetic reader would care because this reduces the cost of running high-quality search systems by letting a big model handle the one-time document encoding and a small model handle the frequent queries. The work shows this alignment also passes on advanced features like quantization tolerance without extra training effort. It demonstrates the result with a new 23M-parameter model that tops the BEIR ranking for its size and improves further when paired asymmetrically with its teacher.

Core claim

LEAF is a distillation framework for text embedding models that creates student models whose representations are aligned with those of the teacher model. This alignment permits asymmetric information retrieval architectures in which the teacher encodes documents and the student encodes queries. The resulting leaf-ir model, with 23 million parameters, establishes a new state of the art on the BEIR benchmark among models of similar size, with performance increasing when operated in the asymmetric mode. The same framework applied to multi-task learning yields leaf-mt, which leads the MTEB leaderboard in its size category.

What carries the argument

Teacher-aligned representations produced by the LEAF distillation process, which maintain compatibility between teacher and student embeddings to enable mixed-scale retrieval without performance loss.

Load-bearing premise

The premise that aligning a small model to a large teacher's outputs will let the small model work well for queries while the large one handles documents.

What would settle it

A direct comparison on the BEIR benchmark showing that the asymmetric setup with the teacher encoding documents and the leaf model encoding queries yields no improvement or lower scores than using the leaf model alone.

Figures

Figures reproduced from arXiv: 2509.12539 by Robin Vujanic, Thomas Rueckstiess.

Figure 1
Figure 1. Figure 1: Our information retrieval oriented leaf-ir model (23M parameters) sets a new state-of-the-art on the BEIR benchmark for ≤100M parameters models. When run in asymmetric mode, its retrieval performance is further increased. Our multi-task leaf-mt model (23M parameters) also sets a new state-of-the-art on MTEB v2 (English) for models of its size. Hatched columns indicate comparison models that leverage larger… view at source ↗
Figure 2
Figure 2. Figure 2: LEAF distillations enable flexible asymmetric archi￾tectures where documents are embedded with large, compu￾tationally expensive models while queries can be encoded with substantially smaller leaf models. Unfortunately, this dramatic performance jump has come at the cost of an explosion in the size [17] of these models, resulting in massive costs in terms of hardware, electric power, and mainte￾nance requi… view at source ↗
Figure 3
Figure 3. Figure 3: leaf model architecture. Let 𝐼 be an index over the training set, and {(𝑡𝑖 ,𝑦ˆ𝑖)}𝑖∈𝐼 be the collec￾tion of tuples of training texts 𝑡𝑖 (strings) and their corresponding ground truth embeddings 𝑦ˆ𝑖 ∈ R 𝑑 produced by the teacher model. The texts𝑡𝑖 can entail both queries as well as documents. Let 𝑥𝑖 ∈ N 𝑇 be the tokenization of 𝑡𝑖 , where 𝑇 is the sequence length. Padding tokens [PAD] are appended to make th… view at source ↗
Figure 4
Figure 4. Figure 4: Model performance when trained for one epoch [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: In this mode, the retrieval performance of our system is [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance curves for various MRL dimension truncations, as well as output quantization regimes. Benchmarked [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance curves across various embedding dimensions with a teacher model trained with MRL [ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NanoBEIR performance of various leaf-ir training checkpoints. the model’s internals (keys, queries and values). It also requires the teacher and student models to adopt the same tokenizer7 and for the two models to have the same number of attention heads. The approach in [10] finds that although different models assign different relevant scores to the same pairs of query and document, the difference in sco… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of learning rate decay schedules. Linear [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pooling layer comparison. Mean pooling achieves [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of the validation losses observed over [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: NanoBEIR performance of various checkpoints [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: leaf-ir wall clock inference times on an i3.large instance for queries and documents, as a function of batch size. leaf models allow for larger batch sizes while remaining under the 0.1 seconds threshhold for queries. 0 5 10 15 20 25 Batch Size 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Wall Clock Time (s) Queries 0.1 sec leaf-ir arctic-embed-m-v1.5 0 5 10 15 20 25 Batch Size 0 2 4 6 8 10 Documents leaf-ir a… view at source ↗
read the original abstract

We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LEAF, a knowledge distillation framework for text embedding models in which student ('leaf') models are aligned to teacher representations. This alignment is presented as enabling flexible asymmetric retrieval architectures, with documents encoded by the larger teacher and queries served by the smaller leaf model. The authors release leaf-ir (23M parameters), claiming new state-of-the-art results on the BEIR benchmark (ranking #1 for its size class) with further gains when run asymmetrically; a multi-task variant leaf-mt is also claimed to set a new SOTA on MTEB v2 English for its size. The framework is described as requiring neither judgments nor hard negatives, operating with small batch sizes, and automatically inheriting properties such as Matryoshka Representation Learning and quantization robustness from the teacher when present.

Significance. If the empirical claims are substantiated with detailed experiments, LEAF could represent a practical advance for efficient deployment of embedding models in information retrieval by lowering data and compute requirements while supporting asymmetric teacher-student combinations. The automatic inheritance of advanced properties without explicit training would be a notable strength for real-world use cases.

major comments (2)
  1. [Abstract] Abstract: The claim that asymmetric mode (teacher-encoded documents scored against leaf-encoded queries) further increases retrieval performance without task-specific fine-tuning or additional negative mining is load-bearing for the central advantage of 'flexible asymmetric architectures'. This rests on the unverified assumption that a standard representation-matching distillation objective applied to the same inputs will preserve ranking quality under cosine or dot-product scoring, despite potential stylistic and length mismatches between queries and documents; no analysis or experiments addressing this transfer are provided.
  2. [Abstract] Abstract: The SOTA claims for leaf-ir on BEIR and leaf-mt on MTEB v2 rest on external public leaderboards rather than quantities defined by fitted parameters or internal controls inside the paper. No ablation studies, statistical significance tests, or experimental details (e.g., exact distillation loss weights, temperature, batch size schedule, or teacher model) are supplied, preventing assessment of whether the 23M-parameter results are robust or confounded by external factors.
minor comments (2)
  1. The abstract and method description would benefit from explicit notation for the distillation objective and how teacher alignment is enforced (e.g., loss form and any hyperparameters).
  2. Consider adding a table comparing leaf-ir against other models of similar parameter count on BEIR subsets to make the leaderboard claim more self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate where we will revise the paper to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that asymmetric mode (teacher-encoded documents scored against leaf-encoded queries) further increases retrieval performance without task-specific fine-tuning or additional negative mining is load-bearing for the central advantage of 'flexible asymmetric architectures'. This rests on the unverified assumption that a standard representation-matching distillation objective applied to the same inputs will preserve ranking quality under cosine or dot-product scoring, despite potential stylistic and length mismatches between queries and documents; no analysis or experiments addressing this transfer are provided.

    Authors: We thank the referee for this important point. The asymmetric mode is indeed central to the practical value of teacher-aligned representations. Our manuscript reports empirical improvements when using the leaf model for queries against teacher-encoded documents on BEIR. To directly address concerns about stylistic and length mismatches and the preservation of ranking quality, we will add a dedicated analysis subsection in the revised manuscript. This will include quantitative comparisons of query versus document representation properties and additional experiments breaking down performance by query length and domain. We believe these additions will substantiate the transfer under the distillation objective. revision: yes

  2. Referee: [Abstract] Abstract: The SOTA claims for leaf-ir on BEIR and leaf-mt on MTEB v2 rest on external public leaderboards rather than quantities defined by fitted parameters or internal controls inside the paper. No ablation studies, statistical significance tests, or experimental details (e.g., exact distillation loss weights, temperature, batch size schedule, or teacher model) are supplied, preventing assessment of whether the 23M-parameter results are robust or confounded by external factors.

    Authors: We appreciate the referee highlighting the need for greater experimental transparency. Public leaderboards are the established evaluation standard for BEIR and MTEB to enable reproducible comparisons across models. Our current manuscript includes core training details and the general framework; however, we agree that additional internal controls would strengthen the work. In the revision we will expand the experimental section and appendix with ablation studies on key hyperparameters (including loss weights, temperature, and batch size), statistical significance testing on the reported metrics, and explicit specification of the teacher model and full training schedule. These additions will allow readers to better assess robustness. revision: yes

Circularity Check

0 steps flagged

No circularity detected in LEAF framework or empirical claims

full rationale

The paper introduces the LEAF distillation framework that aligns student embeddings to a teacher model on the same inputs, enabling asymmetric query-document scoring as an empirical consequence rather than a definitional tautology. SOTA results on BEIR and MTEB are reported via external public leaderboards using standard retrieval metrics, with no evidence that any performance quantity is fitted internally and then renamed as a prediction. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing way that reduces the central claims to the paper's own inputs by construction. The method is presented as self-contained with explicit statements on modest data and infrastructure needs, making the derivation chain independent and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on standard knowledge-distillation assumptions plus the empirical claim that alignment transfers useful properties; no new physical entities or mathematical axioms are introduced.

free parameters (2)
  • distillation loss weights and temperature
    Hyperparameters controlling the alignment objective are chosen during training but not enumerated in the abstract.
  • batch size and learning rate schedule
    Training configuration parameters required to reproduce the reported models.
axioms (1)
  • domain assumption The teacher model produces high-quality representations worth distilling
    The entire pipeline presupposes that the chosen teacher is already effective on the target tasks.

pith-pipeline@v0.9.0 · 5804 in / 1242 out tokens · 36460 ms · 2026-05-18T17:08:36.629240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    Daniel Campos, Alessandro Magnani, and ChengXiang Zhai. 2023. Quick dense retrievers consume kale: Post training kullback leibler alignment of embeddings for asymmetrical dual encoders.arXiv preprint arXiv:2304.01016(2023)

  2. [2]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InPro- ceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 149, 11 pages

  3. [3]

    Xuanang Chen, Ben He, Kai Hui, Le Sun, and Yingfei Sun. 2021. Simplified TinyBERT: Knowledge Distillation for Document Retrieval. InAdvances in In- formation Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 241–248. doi:10.1007/978-3-030-72240-1_21

  4. [4]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  5. [5]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao...

  6. [6]

    Hugging Face. 2023. DistilBERT. https://github.com/huggingface/transformers- research-projects/blob/362a490dc36e91359fe76a7a707dc29e663196b2/ distillation/distiller.py#L434 Accessed: 2025-05-22

  7. [7]

    2021.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. Association for Computing Machinery, New York, NY, USA, 2288–2292. https://doi.org/10.1145/ 3404835.3463098

  8. [8]

    Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. 2019. Amazonqa: A review-based question answering task. arXiv preprint arXiv:1908.04364(2019)

  9. [9]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

  10. [10]

    Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2020. Improving Efficient Neural Ranking Models with Cross- Architecture Knowledge Distillation. arXiv:2010.02666 [cs.IR]

  11. [11]

    Sebastian Hofstätter, Markus Zlabinger, and Allan Hanbury. 2020. Interpretable & time-budget-constrained contextualization for re-ranking. InECAI 2020. IOS Press, 513–520

  12. [12]

    Gautier Izacard and Edouard Grave. 2021. Distilling Knowledge from Reader to Retriever for Question Answering. InInternational Conference on Learning Representations. https://openreview.net/forum?id=NTEz-6wysdb

  13. [13]

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Un- derstanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4163–4174. doi:10.18653...

  14. [14]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu

  15. [15]

    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2567–2577

  16. [16]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion.arXiv e-prints, Article arXiv:1705.03551 (2017), arXiv:1705.03551 pages. arXiv:1705.03551

  17. [17]

    Kaggle. 2023. Dataset containing 479k English words. doi:10.34740/KAGGLE/ DS/4075094

  18. [18]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

  19. [19]

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. 2022. Matryoshka Representation Learning. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35....

  20. [20]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789 [cs.IR] https://arxiv. org/abs/2403.06789

  21. [21]

    Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645–2652

  22. [22]

    Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. 2024. Arctic- embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374(2024)

  23. [23]

    Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2025. Generative Representational Instruc- tion Tuning. InThe Thirteenth International Conference on Learning Representa- tions. https://openreview.net/forum?id=BC4lIvfSzv

  24. [24]

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, P...

  25. [25]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.CoRRabs/1611.09268 (2016). arXiv:1611.09268 http://arxiv.org/abs/1611.09268

  26. [26]

    1994.Usability engineering

    Jakob Nielsen. 1994.Usability engineering. Morgan Kaufmann

  27. [27]

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=n6SCkn2QaG

  28. [28]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

  29. [29]

    Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Em- beddings Multilingual using Knowledge Distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computa- tional Linguistics, Online, 4512–4525. doi:10.18653/...

  30. [30]

    Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau ...

  31. [31]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL] https://arxiv.org/abs/1910.01108

  32. [32]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488 [cs.IR] https://arxiv.org/abs/2112.01488

  33. [33]

    Ferdinand Schlatt, Tim Hagen, Martin Potthast, and Matthias Hagen. 2025. TITE: Token-Independent Text Encoder for Information Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2493–2503. doi:10.1145...

  34. [34]

    Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. 2025. Repetition Improves Language Model Embeddings. In The Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=Ahlrf2HGJR

  35. [35]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Comput...

  36. [36]

    Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984(2020)

  37. [37]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

  38. [38]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  39. [39]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  40. [40]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

  41. [41]

    Xhluca. 2023. BM25 Benchmarks. https://github.com/xhluca/bm25- benchmarks/blob/22df0ccccc083d2985a5b7e55d3ecd1764d4f9ed/README.md? plain=1#L220. Accessed: 2025-05-07

  42. [42]

    best marvel movie

    Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. 2024. Arctic-Embed 2.0: Multilingual Retrieval Without Compromise. arXiv:2412.04506 [cs.CL] https: //arxiv.org/abs/2412.04506 9 Vujanic et al. A Training Setup A.1 Pooling Layer In Figure 8 a comparison of different popular pooling layers is shown. These are obtained by running one training epoch of...