jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Han Xiao; Maximilian Werk; Michael G\"unther; Mohammad Kalim Akram; Nastia Havriushenko; Quentin Herreros; Saba Sturua

arxiv: 2602.15547 · v2 · submitted 2026-02-17 · 💻 cs.CL

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Mohammad Kalim Akram , Saba Sturua , Nastia Havriushenko , Quentin Herreros , Michael G\"unther , Maximilian Werk , Han Xiao This is my paper

Pith reviewed 2026-05-15 21:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords text embeddingsmodel distillationcontrastive losssmall modelssemantic similarityinformation retrievalmodel compression

0 comments

The pith

Combining distillation with task-specific contrastive loss produces compact text embedding models that match or exceed state-of-the-art benchmarks for their size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training approach for text embedding models that merges model distillation from a larger teacher with a contrastive loss tuned to specific tasks. This hybrid regimen aims to generate smaller models that perform better than those trained with contrastive loss or distillation in isolation. If the method works as described, it would enable high-quality embeddings on devices with limited compute while supporting long inputs and quantized outputs. The authors release two resulting models and report competitive scores on standard semantic tasks.

Core claim

The authors introduce a training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Their findings indicate this combined approach trains small models more effectively than purely contrastive or distillation-based methods alone. The resulting jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano models achieve or surpass state-of-the-art scores for comparable sizes, while handling texts up to 32k tokens across languages and remaining robust under truncation and binary quantization.

What carries the argument

The task-targeted embedding distillation regimen that pairs knowledge distillation from a teacher model with a contrastive loss customized to downstream tasks such as retrieval and classification.

If this is right

Small embedding models can reach or exceed the performance of larger ones on semantic similarity, retrieval, clustering, and classification benchmarks.
The models maintain effectiveness on long inputs up to 32k tokens in multiple languages.
Embeddings stay reliable after input truncation or conversion to binary quantized form.
Public release of the model weights allows direct use and further experimentation by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid regimen might transfer to training compact models for other modalities such as images or audio embeddings.
Further reductions in model size could remain viable if the distillation and contrastive components are tuned together.
Integration into retrieval systems could lower memory and latency costs without major accuracy loss.

Load-bearing premise

The performance gains arise specifically from combining distillation with task-specific contrastive loss rather than from unstated differences in training data selection or hyperparameter tuning.

What would settle it

An ablation experiment training identical small models with only distillation, only task-specific contrastive loss, and the full combination, then measuring whether the hybrid version alone reaches the reported benchmark levels.

Figures

Figures reproduced from arXiv: 2602.15547 by Han Xiao, Maximilian Werk, Michael G\"unther, Mohammad Kalim Akram, Nastia Havriushenko, Quentin Herreros, Saba Sturua.

**Figure 2.** Figure 2: Performance of j-v5-text-small on different languages on MMTEB compared to other models highest average scores in their size category. The Qwen3-4B model, which we used as the teacher model, still significantly outperforms our models, but it has more than five times as many parameters as jina-embeddings-v5-text-small and sixteen times as many as jina-embeddings-v5-text-nano. KaLM-mini-v2.5 achieves slight… view at source ↗

**Figure 3.** Figure 3: Performance comparison of different training [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of projection configurations on [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Average MMTEB score across reduced embed [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Learning rate sensitivity across different optimization objectives. We report the average nDCG@10 on the MTEB (English, v2) benchmark using the S2ORC dataset. The plots compare 1×10−4 (blue) and 1×10−5 (orange) learning rates for embedding-based distillation (Ldistill), InfoNCE (L q→d NCE ), and score-based distillation (Lscore), all utilizing a trainable student projection. • InfoNCE (L q→d NCE): In contr… view at source ↗

**Figure 7.** Figure 7: Performance of Models on different languages on MMTEB compared to average performance [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper mixes distillation with task-specific contrastive loss to train small embedding models and releases two that hit size-matched SOTA on benchmarks, but the causal credit for the mix over data or tuning is unproven without ablations.

read the letter

The core point is a training mix of distillation and task-targeted contrastive loss aimed at compact text embedding models. The resulting jina-embeddings-v5-text-small and nano versions are said to match or beat current leaders for their size, while adding long-context support up to 32k tokens, multilingual coverage, and stability under truncation and binary quantization. Public weights are released, which is straightforwardly useful for anyone who wants to test or extend them.

Referee Report

2 major / 2 minor

Summary. The paper introduces jina-embeddings-v5-text, a family of compact text embedding models trained via a novel regimen that combines model distillation techniques with task-specific contrastive loss. It claims this hybrid approach is more effective for small models than purely contrastive or distillation-based training alone, with the resulting jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano variants matching or exceeding state-of-the-art performance on benchmarks for their size. The models support contexts up to 32k tokens across many languages and produce embeddings robust to truncation and binary quantization; weights are released publicly.

Significance. If the performance claims are substantiated with proper controls, the work could meaningfully advance efficient embedding model development by demonstrating a practical hybrid training recipe that improves small-model regimes, with direct implications for deployment in retrieval, clustering, and classification tasks under resource constraints. The public release of weights is a clear strength enabling reproducibility and follow-on research.

major comments (2)

[Abstract and Experimental Results] The central claim that the combined distillation + task-specific contrastive regimen outperforms purely contrastive or purely distillation-based paradigms (abstract) lacks supporting evidence from controlled ablations. No results are shown for the identical small/nano architectures trained on the same data using (a) contrastive loss alone or (b) distillation alone, so attribution of gains to the combination rather than data curation or hyperparameter choices cannot be verified.
[Benchmark Results] Benchmark superiority or parity claims for jina-embeddings-v5-text-small and nano (abstract) are stated without accompanying numerical tables, exact MTEB scores, or direct head-to-head comparisons against named baselines of similar size; this prevents independent verification of the 'exceed or match' assertion.

minor comments (2)

[Abstract] The abstract states support for 'many languages' but does not enumerate the languages or report per-language or cross-lingual metrics.
[Methods] Notation for the task-specific contrastive loss should be formalized with an equation in the methods section to clarify its distinction from standard contrastive objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript requires additional evidence to support the central claims and will revise accordingly to include controlled ablations and explicit benchmark numbers.

read point-by-point responses

Referee: [Abstract and Experimental Results] The central claim that the combined distillation + task-specific contrastive regimen outperforms purely contrastive or purely distillation-based paradigms (abstract) lacks supporting evidence from controlled ablations. No results are shown for the identical small/nano architectures trained on the same data using (a) contrastive loss alone or (b) distillation alone, so attribution of gains to the combination rather than data curation or hyperparameter choices cannot be verified.

Authors: We agree that the absence of controlled ablations prevents clear attribution of gains to the hybrid regimen. In the revised manuscript we will add results for the identical small and nano architectures trained on the same data using (a) contrastive loss alone and (b) distillation alone, with all other factors held constant. These new experiments will be presented in a dedicated ablation subsection. revision: yes
Referee: [Benchmark Results] Benchmark superiority or parity claims for jina-embeddings-v5-text-small and nano (abstract) are stated without accompanying numerical tables, exact MTEB scores, or direct head-to-head comparisons against named baselines of similar size; this prevents independent verification of the 'exceed or match' assertion.

Authors: We acknowledge that the abstract currently lacks specific numerical values. The full experimental section already contains detailed MTEB tables with exact scores and comparisons to named baselines of comparable size (e.g., 22M–50M parameter models). In the revision we will insert a compact summary table of key MTEB scores and direct comparisons into the abstract to enable immediate verification. revision: yes

Circularity Check

0 steps flagged

No derivation chain or circularity present in empirical claims

full rationale

The paper describes an empirical training regimen that combines distillation with task-specific contrastive loss for small embedding models, then reports benchmark scores against external SOTA. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described full text. Claims rest on direct model training results and external benchmark comparisons (MTEB-style), which are falsifiable outside the paper and do not reduce to self-definition or input renaming. This is a standard empirical ML contribution with no circular steps in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all claims rest on unspecified training details and benchmark evaluations.

pith-pipeline@v0.9.0 · 5480 in / 1015 out tokens · 20308 ms · 2026-05-15T21:45:23.804944+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss... Ldistill = sum of cosine distances... Lq→dNCE InfoNCE loss... LGOR global orthogonal regularizer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Compute for Frozen Embedding Models through Agentic Program Search
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
LMEB: Long-horizon Memory Embedding Benchmark
cs.CL 2026-03 unverdicted novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
cs.CL 2026-05 accept novelty 6.0

Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios
cs.LG 2026-05 conditional novelty 6.0

Text embeddings are robust to truncation without MRL except when reducing size by at least 80%.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
cs.IR 2026-05 unverdicted novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
cs.LG 2026-05 unverdicted novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
cs.IR 2026-04 unverdicted novelty 6.0

A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
Granite Embedding Multilingual R2 Models
cs.IR 2026-05 unverdicted novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 11 Pith papers · 6 internal anchors

[1]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,

work page 2019
[2]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025a. Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024a. Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Már- ton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text em- bed...

work page arXiv 1910
[4]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174,

work page 2020
[5]

Improving efficient neural ranking models with cross-architecture knowledge distil- lation.arXiv preprint arXiv:2010.02666,

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation.arXiv preprint arXiv:2010.02666,

work page arXiv 2010
[6]

Embeddistill: A geometric knowledge distillation for information retrieval

Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. Embeddistill: A geometric knowledge distillation for information retrieval. arXiv preprint arXiv:2301.12005,

work page arXiv
[7]

xvlm2vec: Adapting lvlm-based embedding models to multilinguality using self-knowledge distillation.arXiv preprint arXiv:2503.09313,

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, and Giovanni Semeraro. xvlm2vec: Adapting lvlm-based embedding models to multilinguality using self-knowledge distillation.arXiv preprint arXiv:2503.09313,

work page arXiv
[8]

Learning task-agnostic representations through multi- teacher distillation

Philippe Formont, Maxime DARRIN, Banafsheh Karim- ian, Eric Granger, Jackie CK Cheung, Ismail Ben Ayed, Mohammadhadi Shateri, and Pablo Piantanida. Learning task-agnostic representations through multi- teacher distillation. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu....

work page arXiv 2014
[9]

M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2024, pages 2318–2335, Bangkok, ...

work page 2024
[10]

Association for Computational Linguistics. Isabelle Mohr, Markus Krimmel, Saba Sturua, Moham- mad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, et al. Multi-task contrastive learning for 8192-token bilingual text embeddings.arXiv preprint arXiv:2402.17016,

work page arXiv
[11]

One embedder, any task: Instruction-finetuned text embeddings

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121,

work page 2023
[12]

Eurobert: scaling multilingual encoders for european languages

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. Eurobert: scaling multilingual encoders for european languages. arXiv preprint arXiv:2503.05500,

work page internal anchor Pith review arXiv
[13]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Matryoshka Representation Learning

Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. InAdvances in Neural In- formation Processing Systems (NeurIPS 2022),

work page 2022
[15]

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, 2...

work page 2024
[16]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. InSEM 2012: 1st Joint Conference on Lexical and Computational Semantics (SemEval),

work page 2012
[18]

Relational Knowledge Distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation.arXiv preprint arXiv:1904.05068,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[19]

Mteb: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,

work page 2014
[20]

Arctic-embed 2.0: Multilingual retrieval without compromise, 2024

Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Cam- pos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506,

work page arXiv
[21]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm- embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923,

work page arXiv
[23]

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al

State-of-the-art text embedding model with 32,000 token context length. Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval. In Proceedings of the 5th Workshop on Multili...

work page 2025
[24]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych

Accessed: 2026-02-11. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao...

work page 2026
[25]

A Appendix A.1 Hyperparameters The following table outlines all hyperparameters used during the various training phases

GitHub repository, Accessed: 2026-02-16. A Appendix A.1 Hyperparameters The following table outlines all hyperparameters used during the various training phases. For all the LoRA adapters we use a rank of 32 and an alpha value of

work page 2026
[26]

We ei- ther report results that are stated on the MTEB leaderboard11 or self-evaluate them using the mteb package12

384 / 40962·10 −5 250Kλ NCE =λ S = 1 Text-Matching j-v5-text-small20000 1×256 3845·10 −5 1Mτ= 0.02,τ ′ = 0.05, j-v5-text-nano20000 1×256 3845·10 −5 250Kλ NCE = 1,λD = 2 Clustering j-v5-text-small20,000 1×512 5121·10 −5 100K j-v5-text-nano20,000 1×1024 5121·10 −5 25K Classification j-v5-text-small30,000 4×64 5124·10 −4 3.5Mτ= 0.02, j-v5-text-nano30,000 4×1...

work page arXiv 2026

[1] [1]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,

work page 2019

[2] [2]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025a. Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024a. Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Már- ton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text em- bed...

work page arXiv 1910

[4] [4]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174,

work page 2020

[5] [5]

Improving efficient neural ranking models with cross-architecture knowledge distil- lation.arXiv preprint arXiv:2010.02666,

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation.arXiv preprint arXiv:2010.02666,

work page arXiv 2010

[6] [6]

Embeddistill: A geometric knowledge distillation for information retrieval

Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. Embeddistill: A geometric knowledge distillation for information retrieval. arXiv preprint arXiv:2301.12005,

work page arXiv

[7] [7]

xvlm2vec: Adapting lvlm-based embedding models to multilinguality using self-knowledge distillation.arXiv preprint arXiv:2503.09313,

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, and Giovanni Semeraro. xvlm2vec: Adapting lvlm-based embedding models to multilinguality using self-knowledge distillation.arXiv preprint arXiv:2503.09313,

work page arXiv

[8] [8]

Learning task-agnostic representations through multi- teacher distillation

Philippe Formont, Maxime DARRIN, Banafsheh Karim- ian, Eric Granger, Jackie CK Cheung, Ismail Ben Ayed, Mohammadhadi Shateri, and Pablo Piantanida. Learning task-agnostic representations through multi- teacher distillation. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu....

work page arXiv 2014

[9] [9]

M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2024, pages 2318–2335, Bangkok, ...

work page 2024

[10] [10]

Association for Computational Linguistics. Isabelle Mohr, Markus Krimmel, Saba Sturua, Moham- mad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, et al. Multi-task contrastive learning for 8192-token bilingual text embeddings.arXiv preprint arXiv:2402.17016,

work page arXiv

[11] [11]

One embedder, any task: Instruction-finetuned text embeddings

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121,

work page 2023

[12] [12]

Eurobert: scaling multilingual encoders for european languages

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. Eurobert: scaling multilingual encoders for european languages. arXiv preprint arXiv:2503.05500,

work page internal anchor Pith review arXiv

[13] [13]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Matryoshka Representation Learning

Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. InAdvances in Neural In- formation Processing Systems (NeurIPS 2022),

work page 2022

[15] [15]

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, 2...

work page 2024

[16] [16]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. InSEM 2012: 1st Joint Conference on Lexical and Computational Semantics (SemEval),

work page 2012

[18] [18]

Relational Knowledge Distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation.arXiv preprint arXiv:1904.05068,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[19] [19]

Mteb: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,

work page 2014

[20] [20]

Arctic-embed 2.0: Multilingual retrieval without compromise, 2024

Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Cam- pos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506,

work page arXiv

[21] [21]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm- embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923,

work page arXiv

[23] [23]

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al

State-of-the-art text embedding model with 32,000 token context length. Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval. In Proceedings of the 5th Workshop on Multili...

work page 2025

[24] [24]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych

Accessed: 2026-02-11. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao...

work page 2026

[25] [25]

A Appendix A.1 Hyperparameters The following table outlines all hyperparameters used during the various training phases

GitHub repository, Accessed: 2026-02-16. A Appendix A.1 Hyperparameters The following table outlines all hyperparameters used during the various training phases. For all the LoRA adapters we use a rank of 32 and an alpha value of

work page 2026

[26] [26]

We ei- ther report results that are stated on the MTEB leaderboard11 or self-evaluate them using the mteb package12

384 / 40962·10 −5 250Kλ NCE =λ S = 1 Text-Matching j-v5-text-small20000 1×256 3845·10 −5 1Mτ= 0.02,τ ′ = 0.05, j-v5-text-nano20000 1×256 3845·10 −5 250Kλ NCE = 1,λD = 2 Clustering j-v5-text-small20,000 1×512 5121·10 −5 100K j-v5-text-nano20,000 1×1024 5121·10 −5 25K Classification j-v5-text-small30,000 4×64 5124·10 −4 3.5Mτ= 0.02, j-v5-text-nano30,000 4×1...

work page arXiv 2026