jina-embeddings-v5-text: Task-Targeted Embedding Distillation
Pith reviewed 2026-05-15 21:45 UTC · model grok-4.3
The pith
Combining distillation with task-specific contrastive loss produces compact text embedding models that match or exceed state-of-the-art benchmarks for their size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Their findings indicate this combined approach trains small models more effectively than purely contrastive or distillation-based methods alone. The resulting jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano models achieve or surpass state-of-the-art scores for comparable sizes, while handling texts up to 32k tokens across languages and remaining robust under truncation and binary quantization.
What carries the argument
The task-targeted embedding distillation regimen that pairs knowledge distillation from a teacher model with a contrastive loss customized to downstream tasks such as retrieval and classification.
If this is right
- Small embedding models can reach or exceed the performance of larger ones on semantic similarity, retrieval, clustering, and classification benchmarks.
- The models maintain effectiveness on long inputs up to 32k tokens in multiple languages.
- Embeddings stay reliable after input truncation or conversion to binary quantized form.
- Public release of the model weights allows direct use and further experimentation by others.
Where Pith is reading between the lines
- The same hybrid regimen might transfer to training compact models for other modalities such as images or audio embeddings.
- Further reductions in model size could remain viable if the distillation and contrastive components are tuned together.
- Integration into retrieval systems could lower memory and latency costs without major accuracy loss.
Load-bearing premise
The performance gains arise specifically from combining distillation with task-specific contrastive loss rather than from unstated differences in training data selection or hyperparameter tuning.
What would settle it
An ablation experiment training identical small models with only distillation, only task-specific contrastive loss, and the full combination, then measuring whether the hybrid version alone reaches the reported benchmark levels.
Figures
read the original abstract
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces jina-embeddings-v5-text, a family of compact text embedding models trained via a novel regimen that combines model distillation techniques with task-specific contrastive loss. It claims this hybrid approach is more effective for small models than purely contrastive or distillation-based training alone, with the resulting jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano variants matching or exceeding state-of-the-art performance on benchmarks for their size. The models support contexts up to 32k tokens across many languages and produce embeddings robust to truncation and binary quantization; weights are released publicly.
Significance. If the performance claims are substantiated with proper controls, the work could meaningfully advance efficient embedding model development by demonstrating a practical hybrid training recipe that improves small-model regimes, with direct implications for deployment in retrieval, clustering, and classification tasks under resource constraints. The public release of weights is a clear strength enabling reproducibility and follow-on research.
major comments (2)
- [Abstract and Experimental Results] The central claim that the combined distillation + task-specific contrastive regimen outperforms purely contrastive or purely distillation-based paradigms (abstract) lacks supporting evidence from controlled ablations. No results are shown for the identical small/nano architectures trained on the same data using (a) contrastive loss alone or (b) distillation alone, so attribution of gains to the combination rather than data curation or hyperparameter choices cannot be verified.
- [Benchmark Results] Benchmark superiority or parity claims for jina-embeddings-v5-text-small and nano (abstract) are stated without accompanying numerical tables, exact MTEB scores, or direct head-to-head comparisons against named baselines of similar size; this prevents independent verification of the 'exceed or match' assertion.
minor comments (2)
- [Abstract] The abstract states support for 'many languages' but does not enumerate the languages or report per-language or cross-lingual metrics.
- [Methods] Notation for the task-specific contrastive loss should be formalized with an equation in the methods section to clarify its distinction from standard contrastive objectives.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the current manuscript requires additional evidence to support the central claims and will revise accordingly to include controlled ablations and explicit benchmark numbers.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] The central claim that the combined distillation + task-specific contrastive regimen outperforms purely contrastive or purely distillation-based paradigms (abstract) lacks supporting evidence from controlled ablations. No results are shown for the identical small/nano architectures trained on the same data using (a) contrastive loss alone or (b) distillation alone, so attribution of gains to the combination rather than data curation or hyperparameter choices cannot be verified.
Authors: We agree that the absence of controlled ablations prevents clear attribution of gains to the hybrid regimen. In the revised manuscript we will add results for the identical small and nano architectures trained on the same data using (a) contrastive loss alone and (b) distillation alone, with all other factors held constant. These new experiments will be presented in a dedicated ablation subsection. revision: yes
-
Referee: [Benchmark Results] Benchmark superiority or parity claims for jina-embeddings-v5-text-small and nano (abstract) are stated without accompanying numerical tables, exact MTEB scores, or direct head-to-head comparisons against named baselines of similar size; this prevents independent verification of the 'exceed or match' assertion.
Authors: We acknowledge that the abstract currently lacks specific numerical values. The full experimental section already contains detailed MTEB tables with exact scores and comparisons to named baselines of comparable size (e.g., 22M–50M parameter models). In the revision we will insert a compact summary table of key MTEB scores and direct comparisons into the abstract to enable immediate verification. revision: yes
Circularity Check
No derivation chain or circularity present in empirical claims
full rationale
The paper describes an empirical training regimen that combines distillation with task-specific contrastive loss for small embedding models, then reports benchmark scores against external SOTA. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described full text. Claims rest on direct model training results and external benchmark comparisons (MTEB-style), which are falsifiable outside the paper and do not reduce to self-definition or input renaming. This is a standard empirical ML contribution with no circular steps in any derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss... Ldistill = sum of cosine distances... Lq→dNCE InfoNCE loss... LGOR global orthogonal regularizer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 13 Pith papers
-
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
-
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
-
To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios
Text embeddings are robust to truncation without MRL except when reducing size by at least 80%.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
-
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
-
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
A survey that categorizes RIR benchmarks by domain and modality, proposes a taxonomy for integrating reasoning into retrieval pipelines, and outlines key challenges.
-
Granite Embedding Multilingual R2 Models
Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
Reference graph
Works this paper leans on
-
[1]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,
work page 2019
-
[2]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025a. Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024
Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024a. Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Már- ton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text em- bed...
-
[4]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174,
work page 2020
-
[5]
Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation.arXiv preprint arXiv:2010.02666,
-
[6]
Embeddistill: A geometric knowledge distillation for information retrieval
Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. Embeddistill: A geometric knowledge distillation for information retrieval. arXiv preprint arXiv:2301.12005,
-
[7]
Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, and Giovanni Semeraro. xvlm2vec: Adapting lvlm-based embedding models to multilinguality using self-knowledge distillation.arXiv preprint arXiv:2503.09313,
-
[8]
Learning task-agnostic representations through multi- teacher distillation
Philippe Formont, Maxime DARRIN, Banafsheh Karim- ian, Eric Granger, Jackie CK Cheung, Ismail Ben Ayed, Mohammadhadi Shateri, and Pablo Piantanida. Learning task-agnostic representations through multi- teacher distillation. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems. Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu....
-
[9]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2024, pages 2318–2335, Bangkok, ...
work page 2024
-
[10]
Association for Computational Linguistics. Isabelle Mohr, Markus Krimmel, Saba Sturua, Moham- mad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, et al. Multi-task contrastive learning for 8192-token bilingual text embeddings.arXiv preprint arXiv:2402.17016,
-
[11]
One embedder, any task: Instruction-finetuned text embeddings
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121,
work page 2023
-
[12]
Eurobert: scaling multilingual encoders for european languages
Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. Eurobert: scaling multilingual encoders for european languages. arXiv preprint arXiv:2503.05500,
work page internal anchor Pith review arXiv
-
[13]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Matryoshka Representation Learning
Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. InAdvances in Neural In- formation Processing Systems (NeurIPS 2022),
work page 2022
-
[15]
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, 2...
work page 2024
-
[16]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. InSEM 2012: 1st Joint Conference on Lexical and Computational Semantics (SemEval),
work page 2012
-
[18]
Relational Knowledge Distillation
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation.arXiv preprint arXiv:1904.05068,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[19]
Mteb: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,
work page 2014
-
[20]
Arctic-embed 2.0: Multilingual retrieval without compromise, 2024
Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Cam- pos. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506,
-
[21]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm- embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923,
-
[23]
State-of-the-art text embedding model with 32,000 token context length. Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval. In Proceedings of the 5th Workshop on Multili...
work page 2025
-
[24]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych
Accessed: 2026-02-11. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao...
work page 2026
-
[25]
GitHub repository, Accessed: 2026-02-16. A Appendix A.1 Hyperparameters The following table outlines all hyperparameters used during the various training phases. For all the LoRA adapters we use a rank of 32 and an alpha value of
work page 2026
-
[26]
384 / 40962·10 −5 250Kλ NCE =λ S = 1 Text-Matching j-v5-text-small20000 1×256 3845·10 −5 1Mτ= 0.02,τ ′ = 0.05, j-v5-text-nano20000 1×256 3845·10 −5 250Kλ NCE = 1,λD = 2 Clustering j-v5-text-small20,000 1×512 5121·10 −5 100K j-v5-text-nano20,000 1×1024 5121·10 −5 25K Classification j-v5-text-small30,000 4×64 5124·10 −4 3.5Mτ= 0.02, j-v5-text-nano30,000 4×1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.