Multilingual E5 Text Embeddings: A Technical Report
Pith reviewed 2026-05-12 19:12 UTC · model grok-4.3
The pith
Multilingual E5 embeddings match English state-of-the-art performance using the same training recipe
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The training methodology from the English E5 model transfers directly to multilingual data through contrastive pre-training on 1 billion multilingual text pairs followed by fine-tuning on a combination of labeled datasets, yielding three embedding models of different sizes whose performance is competitive across languages, plus an instruction-tuned model that reaches parity with state-of-the-art English-only models of similar sizes.
What carries the argument
Contrastive pre-training on 1 billion multilingual text pairs, which builds general cross-lingual representations before supervised fine-tuning adapts them for downstream use.
Load-bearing premise
The assumption that the training methodology from the English E5 model can be directly applied to multilingual data to achieve comparable performance without language-specific adjustments or biases.
What would settle it
A large-scale evaluation on diverse multilingual benchmarks where the new models show substantially lower performance than English-only counterparts on cross-lingual tasks, or where language-specific fine-tuning proves necessary for competitive results, would disprove the direct-transfer claim.
read the original abstract
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This technical report describes the training of open-source multilingual E5 text embedding models in three sizes (small, base, large). The procedure follows the English E5 recipe exactly: contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on labeled datasets. It additionally introduces an instruction-tuned variant whose performance is claimed to be on par with English-only SOTA models of similar size. Models are released at the provided GitHub link.
Significance. If the performance claims are substantiated, the work delivers practical open-source multilingual embeddings that could reduce reliance on English-centric models for cross-lingual tasks while maintaining competitive quality. The public release of models and the direct reuse of a proven training recipe support reproducibility and adoption in the community.
major comments (2)
- [Training Methodology and Evaluation sections] The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
- [Evaluation results] Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.
minor comments (2)
- [Model description] Model sizes (parameter counts) for 'small / base / large' are stated but not tabulated with exact figures; add a table row for clarity.
- [Abstract] The abstract's performance claim would benefit from naming the specific English SOTA baselines and metrics (e.g., MTEB scores) rather than the generic phrase 'on par with state-of-the-art'.
Simulated Author's Rebuttal
We appreciate the referee's thorough review of our technical report on Multilingual E5 Text Embeddings. We address each major comment in detail below and outline the changes we plan to make in the revised manuscript.
read point-by-point responses
-
Referee: The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
Authors: We acknowledge that an ablation study comparing the multilingual pre-training to an English-only baseline on the same data volume would provide valuable insights into potential negative transfer or data imbalance effects. However, training an additional model on 1 billion pairs requires significant computational resources that were not available for this technical report. The manuscript demonstrates that applying the English E5 recipe to multilingual data yields models whose performance is on par with English SOTA models on relevant benchmarks. This outcome indicates that any negative transfer effects are not substantial enough to prevent achieving competitive results. In the revised manuscript, we will expand the discussion section to explicitly address this point and suggest it as an avenue for future research. revision: partial
-
Referee: Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.
Authors: We agree with the referee that per-language breakdowns and language-balance statistics would improve the transparency of our results. The 1 billion text pairs corpus was curated to include a diverse set of languages, with efforts to balance representation where data availability permitted. In the revised version of the manuscript, we will add language distribution statistics for the pre-training corpus and include per-language performance metrics for the evaluation datasets where such breakdowns are feasible and meaningful. revision: yes
Circularity Check
No circularity: empirical training report with no derivations or self-referential reductions.
full rationale
The manuscript is a technical report describing contrastive pre-training on 1B multilingual pairs followed by fine-tuning, explicitly adhering to the prior English E5 recipe and reporting downstream evaluation scores. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The performance parity claim is an empirical observation, not a derivation that reduces to its inputs by construction. Self-citation to the English E5 work is present but not load-bearing for any mathematical result; the central content remains independent training and evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 52 Pith papers
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations
Co-citation predictability for statute retrieval decays over 20 years in Ukrainian court data, dropping 33-47% in MRR with non-uniform patterns across legal domains.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
RARE builds redundancy-aware benchmarks via atomic fact decomposition and CRRF-enhanced LLM generation, showing retriever PerfRecall@10 dropping from 66.4% on general data to 5.0-27.9% on high-similarity finance/legal...
-
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...
-
Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...
-
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unreso...
-
To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings, Except In Heavy Truncation Scenarios
Text embeddings are robust to truncation without MRL except when reducing size by at least 80%.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
-
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
-
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
-
JFinTEB: Japanese Financial Text Embedding Benchmark
JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.
-
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
Adaptive Prompt Elicitation for Text-to-Image Generation
Adaptive Prompt Elicitation (APE) uses an information-theoretic framework to generate visual queries that elicit and compile user intent into better prompts for text-to-image models, showing improved alignment in benc...
-
Reliable Evaluation Protocol for Low-Precision Retrieval
Proposes High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM) to reduce tie-induced instability in low-precision retrieval evaluation.
-
Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
-
Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning
QCEA reformulates entity alignment as a query-conditioned ranking task with semantic encoding, graph learning, and direction-aware transformation to handle context-dependent, asymmetric correspondences in medical know...
-
QOuLiPo: What a quantum computer sees when it reads a book
Literary texts are turned into graphs for neutral-atom quantum processors, with a new rigidity metric distinguishing structural uniqueness and a QOuLiPo corpus of engineered texts created to match hardware-native graphs.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
-
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...
-
Human-Inspired Context-Selective Multimodal Memory for Social Robots
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
-
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
-
On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework
Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.
-
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...
-
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
-
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
A distillation-plus-task-contrastive training regimen yields compact embedding models that match or exceed state-of-the-art performance for their size while supporting 32k-token contexts and quantization.
-
Triplet Feature Fusion for Equipment Anomaly Prediction : An Open-Source Methodology Using Small Foundation Models
Triplet fusion of 28 statistical features, 64-dim time-series embeddings from a 133K-param model, and 1024-dim text embeddings into LightGBM yields 0.992 precision and 0.998 AUC on 67k HVAC samples while cutting false...
-
Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms
ELERAG integrates Wikidata entity linking with hybrid RRF re-ranking into RAG and outperforms baselines on a custom Italian academic dataset while cross-encoder methods win on the general SQuAD-it dataset.
-
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
-
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models
Fine-tuned e5_large LLM reaches 0.866 F1_micro on ICD classification of 145k Spanish psychiatric texts, outperforming BoW, TF-IDF, and other transformers.
-
Granite Embedding Multilingual R2 Models
Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
-
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
-
KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model
Framework using XLM-RoBERTa, back-translation augmentation, and language-specific thresholds detects reclaimed slurs with 2-5% F1 score gains.
-
KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model
A system using XLM-RoBERTa, GPT-4 back-translation augmentation, undersampling, and language-specific threshold tuning reports 2-5% F1 gains on multilingual slur reclamation detection.
-
HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation
A multi-agent LLM system using CrewAI and RAG improves response coherence and correctness over a single-LLM RAG baseline for Brazilian labor law Q&A.
Reference graph
Works this paper leans on
-
[1]
Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597--610
work page 2019
-
[2]
Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. https://arxiv.org/abs/1611.09268 Ms marco: A human generated machine reading comprehension dataset . ArXiv preprint, abs/1611.09268
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...
-
[5]
DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. https://kaggle.com/competitions/quora-question-pairs Quora question pairs
work page 2017
-
[7]
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891
work page 2022
-
[9]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. https://arxiv.org/abs/2112.09118 Towards unsupervised dense information retrieval with contrastive learning . ArXiv preprint, abs/2112.09118
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2orc: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969--4983
work page 2020
-
[12]
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics
work page 2023
-
[15]
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...
work page 2022
-
[16]
OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . ArXiv preprint, abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, QiaoQiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. https://aclanthology.org/2022.emnlp-main.357 D u R eader-retrieval: A large-scale C hinese benchmark for passage retrieval from web search engine . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5326--5338, A...
work page 2022
-
[20]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . ArXiv preprint, abs/2212.03533
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140--2151
work page 2021
-
[23]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. https://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding . ArXiv preprint, abs/2309.07597
work page internal anchor Pith review arXiv 2023
-
[24]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483--498
work page 2021
-
[27]
Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114--1131
work page 2023
-
[28]
Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39--42
work page 2018
-
[29]
doi: 10.18653/v1/2020.emnlp-main.550
Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau , booktitle =. Dense Passage Retrieval for Open-Domain Question Answering , url =. doi:10.18653/v1/2020.emnlp-main.550 , pages =
-
[30]
Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils , booktitle =
-
[31]
Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas A. Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P. Schnurr and Felipe Petroski Such and Kenny Sai-Kin H...
-
[32]
Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =
Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =
-
[33]
Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =
-
[34]
S im CSE : Simple Contrastive Learning of Sentence Embeddings
Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle =. doi:10.18653/v1/2021.emnlp-main.552 , pages =
-
[35]
Large Dual Encoders Are Generalizable Retrievers , url =
Ni, Jianmo and Qu, Chen and Lu, Jing and Dai, Zhuyun and Hernandez Abrego, Gustavo and Ma, Ji and Zhao, Vincent and Luan, Yi and Hall, Keith and Chang, Ming-Wei and Yang, Yinfei , booktitle =. Large Dual Encoders Are Generalizable Retrievers , url =
-
[36]
Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. doi:10.18653/v1/D19-1410 , pages =
-
[37]
Hall, Daniel Cer, and Yinfei Yang
Ni, Jianmo and Hernandez Abrego, Gustavo and Constant, Noah and Ma, Ji and Hall, Keith and Cer, Daniel and Yang, Yinfei , booktitle =. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =. doi:10.18653/v1/2022.findings-acl.146 , pages =
-
[38]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =
Daniel Fernando Campos and Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng and Bhaskar Mitra , journal =. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =
-
[39]
FEVER: a large-scale dataset for Fact Extraction and VERification
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , pages =
work page internal anchor Pith review doi:10.18653/v1/n18-1074
-
[40]
Cohen and Ruslan Salakhutdinov and Christopher D
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , pages =
-
[41]
Unsupervised Cross-lingual Representation Learning at Scale , url =
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm. Unsupervised Cross-lingual Representation Learning at Scale , url =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , doi =
- [42]
-
[43]
Text embeddings by weakly-supervised contrastive pre-training , url =
Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal =. Text embeddings by weakly-supervised contrastive pre-training , url =
-
[44]
C-pack: Packaged resources to advance general chinese embedding , url =
Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighof, Niklas , journal =. C-pack: Packaged resources to advance general chinese embedding , url =
-
[45]
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =
Xinyu Crystina Zhang and Nandan Thakur and Odunayo Ogundepo and Ehsan Kamalloo and David Alfonso-Hermelo and Xiaoguang Li and Qun Liu and Mehdi Rezagholizadeh and Jimmy Lin , journal =. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =
-
[46]
Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy , booktitle =. Mr. doi:10.18653/v1/2021.mrl-1.12 , pages =
-
[47]
Qiu, Yifu and Li, Hongyu and Qu, Yingqi and Chen, Ying and She, QiaoQiao and Liu, Jing and Wu, Hua and Wang, Haifeng , booktitle =
-
[48]
ELI5: long form question answering
Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. doi:10.18653/v1/P19-1346 , pages =
-
[49]
DataCanary and hilfialkaff and Jiang, Lili and Risdal, Meg and Dandekar, Nikhil and tomtung , publisher =. Quora Question Pairs , url =
-
[50]
Improving text embeddings with large language models,
Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=
-
[51]
Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=
Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=
-
[52]
Transactions of the Association for Computational Linguistics , year=
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , author=. Transactions of the Association for Computational Linguistics , year=
-
[53]
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[54]
No Language Left Behind: Scaling Human-Centered Machine Translation
No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
S2ORC: The Semantic Scholar Open Research Corpus , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[56]
Crosslingual generalization through multitask finetuning
Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=
-
[57]
Language-agnostic BERT Sentence Embedding , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[58]
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.