arxiv: 2402.05672 · v1 · submitted 2024-02-08 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang , Nan Yang , Xiaolong Huang , Linjun Yang , Rangan Majumder , Furu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:12 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords multilingual embeddingstext embeddingscontrastive learninginstruction tuningE5 modelscross-lingual retrievalsemantic searchnatural language processing

0 comments

The pith

Multilingual E5 embeddings match English state-of-the-art performance using the same training recipe

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper details how to create text embedding models that function well across many languages by following the established English E5 training steps without major alterations. It begins with contrastive pre-training on one billion multilingual text pairs to build broad representations, then applies fine-tuning on labeled datasets for task adaptation, and includes a new instruction-tuned variant. This matters for practical use because reliable multilingual embeddings improve semantic search, retrieval, and understanding in non-English contexts where high-quality options have been scarce. The models are released in small, base, and large sizes to let users trade off speed against accuracy. The standout result is that the instruction-tuned multilingual version performs at the level of leading English-only models of similar size.

Core claim

The training methodology from the English E5 model transfers directly to multilingual data through contrastive pre-training on 1 billion multilingual text pairs followed by fine-tuning on a combination of labeled datasets, yielding three embedding models of different sizes whose performance is competitive across languages, plus an instruction-tuned model that reaches parity with state-of-the-art English-only models of similar sizes.

What carries the argument

Contrastive pre-training on 1 billion multilingual text pairs, which builds general cross-lingual representations before supervised fine-tuning adapts them for downstream use.

Load-bearing premise

The assumption that the training methodology from the English E5 model can be directly applied to multilingual data to achieve comparable performance without language-specific adjustments or biases.

What would settle it

A large-scale evaluation on diverse multilingual benchmarks where the new models show substantially lower performance than English-only counterparts on cross-lingual tasks, or where language-specific fine-tuning proves necessary for competitive results, would disprove the direct-transfer claim.

read the original abstract

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multilingual E5 release gives practitioners usable open models that extend the English recipe to more languages and add instruction tuning with claimed parity to English SOTA, but the paper offers little evidence that the direct transfer avoids interference or imbalance issues.

read the letter

The main thing to know is that this technical report releases three sizes of multilingual E5 embeddings plus an instruction-tuned version, trained on a billion multilingual pairs using the original English E5 contrastive recipe followed by fine-tuning. The authors report that the instruction-tuned model reaches performance levels comparable to English-only models of similar size, and they point to a GitHub repo for the weights.

Referee Report

2 major / 2 minor

Summary. This technical report describes the training of open-source multilingual E5 text embedding models in three sizes (small, base, large). The procedure follows the English E5 recipe exactly: contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on labeled datasets. It additionally introduces an instruction-tuned variant whose performance is claimed to be on par with English-only SOTA models of similar size. Models are released at the provided GitHub link.

Significance. If the performance claims are substantiated, the work delivers practical open-source multilingual embeddings that could reduce reliance on English-centric models for cross-lingual tasks while maintaining competitive quality. The public release of models and the direct reuse of a proven training recipe support reproducibility and adoption in the community.

major comments (2)

[Training Methodology and Evaluation sections] The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
[Evaluation results] Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.

minor comments (2)

[Model description] Model sizes (parameter counts) for 'small / base / large' are stated but not tabulated with exact figures; add a table row for clarity.
[Abstract] The abstract's performance claim would benefit from naming the specific English SOTA baselines and metrics (e.g., MTEB scores) rather than the generic phrase 'on par with state-of-the-art'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review of our technical report on Multilingual E5 Text Embeddings. We address each major comment in detail below and outline the changes we plan to make in the revised manuscript.

read point-by-point responses

Referee: The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.

Authors: We acknowledge that an ablation study comparing the multilingual pre-training to an English-only baseline on the same data volume would provide valuable insights into potential negative transfer or data imbalance effects. However, training an additional model on 1 billion pairs requires significant computational resources that were not available for this technical report. The manuscript demonstrates that applying the English E5 recipe to multilingual data yields models whose performance is on par with English SOTA models on relevant benchmarks. This outcome indicates that any negative transfer effects are not substantial enough to prevent achieving competitive results. In the revised manuscript, we will expand the discussion section to explicitly address this point and suggest it as an avenue for future research. revision: partial
Referee: Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.

Authors: We agree with the referee that per-language breakdowns and language-balance statistics would improve the transparency of our results. The 1 billion text pairs corpus was curated to include a diverse set of languages, with efforts to balance representation where data availability permitted. In the revised version of the manuscript, we will add language distribution statistics for the pre-training corpus and include per-language performance metrics for the evaluation datasets where such breakdowns are feasible and meaningful. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training report with no derivations or self-referential reductions.

full rationale

The manuscript is a technical report describing contrastive pre-training on 1B multilingual pairs followed by fine-tuning, explicitly adhering to the prior English E5 recipe and reporting downstream evaluation scores. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The performance parity claim is an empirical observation, not a derivation that reduces to its inputs by construction. Self-citation to the English E5 work is present but not load-bearing for any mathematical result; the central content remains independent training and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical technical report on model training, there are no explicit free parameters, mathematical axioms, or newly invented entities described in the abstract. The work relies on standard machine learning practices and datasets.

pith-pipeline@v0.9.0 · 5422 in / 990 out tokens · 57506 ms · 2026-05-12T19:12:33.165163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
cs.CL 2026-04 unverdicted novelty 7.0

RARE builds redundancy-aware benchmarks via atomic fact decomposition and CRRF-enhanced LLM generation, showing retriever PerfRecall@10 dropping from 66.4% on general data to 5.0-27.9% on high-similarity finance/legal...
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
cs.IR 2026-04 unverdicted novelty 7.0

Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...
Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
cs.CL 2026-04 unverdicted novelty 7.0

Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.
LMEB: Long-horizon Memory Embedding Benchmark
cs.CL 2026-03 unverdicted novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
An Annotation Scheme and Classifier for Personal Facts in Dialogue
cs.CL 2026-05 accept novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
cs.IR 2026-05 unverdicted novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
cs.LG 2026-05 unverdicted novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
JFinTEB: Japanese Financial Text Embedding Benchmark
cs.IR 2026-04 unverdicted novelty 6.0

JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
cs.CV 2026-04 conditional novelty 6.0

VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
QOuLiPo: What a quantum computer sees when it reads a book
quant-ph 2026-05 unverdicted novelty 5.0

Literary texts are turned into graphs for neutral-atom quantum processors, with a new rigidity metric distinguishing structural uniqueness and a QOuLiPo corpus of engineered texts created to match hardware-native graphs.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
cs.CL 2026-05 unverdicted novelty 5.0

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
cs.CV 2026-04 unverdicted novelty 5.0

AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...
Human-Inspired Context-Selective Multimodal Memory for Social Robots
cs.AI 2026-04 unverdicted novelty 5.0

A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
cs.CL 2026-04 unverdicted novelty 5.0

Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework
cs.IR 2026-04 unverdicted novelty 5.0

Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
cs.CL 2026-04 unverdicted novelty 5.0

ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
q-bio.QM 2026-04 unverdicted novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Granite Embedding Multilingual R2 Models
cs.IR 2026-05 unverdicted novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
cs.IR 2026-04 unverdicted novelty 4.0

A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
cs.CL 2026-04 unverdicted novelty 4.0

Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
cs.CL 2026-01 unverdicted novelty 4.0

Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Continual Learning with Multilingual Foundation Model
cs.CL 2026-05 unverdicted novelty 3.0

Framework using XLM-RoBERTa, back-translation augmentation, and language-specific thresholds detects reclaimed slurs with 2-5% F1 score gains.
HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation
cs.IR 2026-03 unverdicted novelty 3.0

A multi-agent LLM system using CrewAI and RAG improves response coherence and correctness over a single-LLM RAG baseline for Brazilian labor law Q&A.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 32 Pith papers · 7 internal anchors

[1]

Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597--610

work page 2019
[2]

Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. https://arxiv.org/abs/1611.09268 Ms marco: A human generated machine reading comprehension dataset . ArXiv preprint, abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2020.acl-main.747 2020
[5]

DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. https://kaggle.com/competitions/quora-question-pairs Quora question pairs

work page 2017
[7]

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891

work page 2022
[9]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. https://arxiv.org/abs/2112.09118 Towards unsupervised dense information retrieval with contrastive learning . ArXiv preprint, abs/2112.09118

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2orc: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969--4983

work page 2020
[12]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page 2023
[15]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...

work page 2022
[16]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . ArXiv preprint, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, QiaoQiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. https://aclanthology.org/2022.emnlp-main.357 D u R eader-retrieval: A large-scale C hinese benchmark for passage retrieval from web search engine . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5326--5338, A...

work page 2022
[20]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . ArXiv preprint, abs/2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140--2151

work page 2021
[23]

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. https://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding . ArXiv preprint, abs/2309.07597

work page internal anchor Pith review arXiv 2023
[24]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483--498

work page 2021
[27]

Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114--1131

work page 2023
[28]

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39--42

work page 2018
[29]

Dense passage retrieval for open-domain question answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau , booktitle =. Dense Passage Retrieval for Open-Domain Question Answering , url =. doi:10.18653/v1/2020.emnlp-main.550 , pages =

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[30]

Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils , booktitle =

work page
[31]

Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P

Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas A. Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P. Schnurr and Felipe Petroski Such and Kenny Sai-Kin H...

work page
[32]

Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =

Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =

work page
[33]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =

Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =

work page
[34]

SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle =. doi:10.18653/v1/2021.emnlp-main.552 , pages =

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[35]

Large Dual Encoders Are Generalizable Retrievers , url =

Ni, Jianmo and Qu, Chen and Lu, Jing and Dai, Zhuyun and Hernandez Abrego, Gustavo and Ma, Ji and Zhao, Vincent and Luan, Yi and Hall, Keith and Chang, Ming-Wei and Yang, Yinfei , booktitle =. Large Dual Encoders Are Generalizable Retrievers , url =

work page
[36]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. doi:10.18653/v1/D19-1410 , pages =

work page doi:10.18653/v1/d19-1410
[37]

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =

Ni, Jianmo and Hernandez Abrego, Gustavo and Constant, Noah and Ma, Ji and Hall, Keith and Cer, Daniel and Yang, Yinfei , booktitle =. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =. doi:10.18653/v1/2022.findings-acl.146 , pages =

work page doi:10.18653/v1/2022.findings-acl.146 2022
[38]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =

Daniel Fernando Campos and Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng and Bhaskar Mitra , journal =. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =

work page
[39]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , pages =

work page internal anchor Pith review doi:10.18653/v1/n18-1074
[40]

, booktitle =

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , pages =

work page doi:10.18653/v1/d18-1259
[41]

Unsupervised Cross-lingual Representation Learning at Scale , url =

Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm. Unsupervised Cross-lingual Representation Learning at Scale , url =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , doi =

work page
[42]

GPT-4 Technical Report , url =

OpenAI , journal =. GPT-4 Technical Report , url =

work page
[43]

Text embeddings by weakly-supervised contrastive pre-training , url =

Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal =. Text embeddings by weakly-supervised contrastive pre-training , url =

work page
[44]

C-pack: Packaged resources to advance general chinese embedding , url =

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighof, Niklas , journal =. C-pack: Packaged resources to advance general chinese embedding , url =

work page
[45]

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =

Xinyu Crystina Zhang and Nandan Thakur and Odunayo Ogundepo and Ehsan Kamalloo and David Alfonso-Hermelo and Xiaoguang Li and Qun Liu and Mehdi Rezagholizadeh and Jimmy Lin , journal =. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =

work page
[46]

Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy , booktitle =. Mr. doi:10.18653/v1/2021.mrl-1.12 , pages =

work page doi:10.18653/v1/2021.mrl-1.12 2021
[47]

Qiu, Yifu and Li, Hongyu and Qu, Yingqi and Chen, Ying and She, QiaoQiao and Liu, Jing and Wu, Hua and Wang, Haifeng , booktitle =

work page
[48]

doi:10.18653/v1/P19-1346 , pages =

Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. doi:10.18653/v1/P19-1346 , pages =

work page doi:10.18653/v1/p19-1346
[49]

Quora Question Pairs , url =

DataCanary and hilfialkaff and Jiang, Lili and Risdal, Meg and Dandekar, Nikhil and tomtung , publisher =. Quora Question Pairs , url =

work page
[50]

arXiv preprint arXiv:2401.00368

Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=

work page arXiv
[51]

Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=

Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=

work page
[52]

Transactions of the Association for Computational Linguistics , year=

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , author=. Transactions of the Association for Computational Linguistics , year=

work page
[53]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[54]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

S2ORC: The Semantic Scholar Open Research Corpus , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[56]

arXiv preprint arXiv:2211.01786 , year=

Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

work page arXiv
[57]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Language-agnostic BERT Sentence Embedding , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[58]

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

work page 2021