pith. machine review for the scientific record. sign in

arxiv: 2402.05672 · v1 · submitted 2024-02-08 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Multilingual E5 Text Embeddings: A Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:12 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords multilingual embeddingstext embeddingscontrastive learninginstruction tuningE5 modelscross-lingual retrievalsemantic searchnatural language processing
0
0 comments X

The pith

Multilingual E5 embeddings match English state-of-the-art performance using the same training recipe

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper details how to create text embedding models that function well across many languages by following the established English E5 training steps without major alterations. It begins with contrastive pre-training on one billion multilingual text pairs to build broad representations, then applies fine-tuning on labeled datasets for task adaptation, and includes a new instruction-tuned variant. This matters for practical use because reliable multilingual embeddings improve semantic search, retrieval, and understanding in non-English contexts where high-quality options have been scarce. The models are released in small, base, and large sizes to let users trade off speed against accuracy. The standout result is that the instruction-tuned multilingual version performs at the level of leading English-only models of similar size.

Core claim

The training methodology from the English E5 model transfers directly to multilingual data through contrastive pre-training on 1 billion multilingual text pairs followed by fine-tuning on a combination of labeled datasets, yielding three embedding models of different sizes whose performance is competitive across languages, plus an instruction-tuned model that reaches parity with state-of-the-art English-only models of similar sizes.

What carries the argument

Contrastive pre-training on 1 billion multilingual text pairs, which builds general cross-lingual representations before supervised fine-tuning adapts them for downstream use.

Load-bearing premise

The assumption that the training methodology from the English E5 model can be directly applied to multilingual data to achieve comparable performance without language-specific adjustments or biases.

What would settle it

A large-scale evaluation on diverse multilingual benchmarks where the new models show substantially lower performance than English-only counterparts on cross-lingual tasks, or where language-specific fine-tuning proves necessary for competitive results, would disprove the direct-transfer claim.

read the original abstract

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This technical report describes the training of open-source multilingual E5 text embedding models in three sizes (small, base, large). The procedure follows the English E5 recipe exactly: contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on labeled datasets. It additionally introduces an instruction-tuned variant whose performance is claimed to be on par with English-only SOTA models of similar size. Models are released at the provided GitHub link.

Significance. If the performance claims are substantiated, the work delivers practical open-source multilingual embeddings that could reduce reliance on English-centric models for cross-lingual tasks while maintaining competitive quality. The public release of models and the direct reuse of a proven training recipe support reproducibility and adoption in the community.

major comments (2)
  1. [Training Methodology and Evaluation sections] The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
  2. [Evaluation results] Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.
minor comments (2)
  1. [Model description] Model sizes (parameter counts) for 'small / base / large' are stated but not tabulated with exact figures; add a table row for clarity.
  2. [Abstract] The abstract's performance claim would benefit from naming the specific English SOTA baselines and metrics (e.g., MTEB scores) rather than the generic phrase 'on par with state-of-the-art'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review of our technical report on Multilingual E5 Text Embeddings. We address each major comment in detail below and outline the changes we plan to make in the revised manuscript.

read point-by-point responses
  1. Referee: The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.

    Authors: We acknowledge that an ablation study comparing the multilingual pre-training to an English-only baseline on the same data volume would provide valuable insights into potential negative transfer or data imbalance effects. However, training an additional model on 1 billion pairs requires significant computational resources that were not available for this technical report. The manuscript demonstrates that applying the English E5 recipe to multilingual data yields models whose performance is on par with English SOTA models on relevant benchmarks. This outcome indicates that any negative transfer effects are not substantial enough to prevent achieving competitive results. In the revised manuscript, we will expand the discussion section to explicitly address this point and suggest it as an avenue for future research. revision: partial

  2. Referee: Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.

    Authors: We agree with the referee that per-language breakdowns and language-balance statistics would improve the transparency of our results. The 1 billion text pairs corpus was curated to include a diverse set of languages, with efforts to balance representation where data availability permitted. In the revised version of the manuscript, we will add language distribution statistics for the pre-training corpus and include per-language performance metrics for the evaluation datasets where such breakdowns are feasible and meaningful. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training report with no derivations or self-referential reductions.

full rationale

The manuscript is a technical report describing contrastive pre-training on 1B multilingual pairs followed by fine-tuning, explicitly adhering to the prior English E5 recipe and reporting downstream evaluation scores. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The performance parity claim is an empirical observation, not a derivation that reduces to its inputs by construction. Self-citation to the English E5 work is present but not load-bearing for any mathematical result; the central content remains independent training and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical technical report on model training, there are no explicit free parameters, mathematical axioms, or newly invented entities described in the abstract. The work relies on standard machine learning practices and datasets.

pith-pipeline@v0.9.0 · 5422 in / 990 out tokens · 57506 ms · 2026-05-12T19:12:33.165163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 7.0

    Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.

  2. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  3. Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

    cs.CL 2026-05 unverdicted novelty 7.0

    EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.

  4. ATIR: Towards Audio-Text Interleaved Contextual Retrieval

    cs.SD 2026-04 unverdicted novelty 7.0

    Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

  5. RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

    cs.CL 2026-04 unverdicted novelty 7.0

    RARE builds redundancy-aware benchmarks via atomic fact decomposition and CRRF-enhanced LLM generation, showing retriever PerfRecall@10 dropping from 66.4% on general data to 5.0-27.9% on high-similarity finance/legal...

  6. Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

    cs.IR 2026-04 unverdicted novelty 7.0

    Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...

  7. Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

    cs.CL 2026-04 unverdicted novelty 7.0

    Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.

  8. LMEB: Long-horizon Memory Embedding Benchmark

    cs.CL 2026-03 unverdicted novelty 7.0

    LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

  9. An Annotation Scheme and Classifier for Personal Facts in Dialogue

    cs.CL 2026-05 accept novelty 6.0

    An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...

  10. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 6.0

    GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...

  11. MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

    cs.IR 2026-05 unverdicted novelty 6.0

    MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...

  12. Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.

  13. Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.

  14. JFinTEB: Japanese Financial Text Embedding Benchmark

    cs.IR 2026-04 unverdicted novelty 6.0

    JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.

  15. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  16. VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

    cs.CV 2026-04 conditional novelty 6.0

    VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...

  17. Learning to Retrieve from Agent Trajectories

    cs.IR 2026-03 conditional novelty 6.0

    Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

  18. QOuLiPo: What a quantum computer sees when it reads a book

    quant-ph 2026-05 unverdicted novelty 5.0

    Literary texts are turned into graphs for neutral-atom quantum processors, with a new rigidity metric distinguishing structural uniqueness and a QOuLiPo corpus of engineered texts created to match hardware-native graphs.

  19. GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

    cs.CL 2026-05 unverdicted novelty 5.0

    GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.

  20. GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

    cs.CL 2026-05 unverdicted novelty 5.0

    GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.

  21. CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 5.0

    CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.

  22. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

    cs.CV 2026-04 unverdicted novelty 5.0

    AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...

  23. Human-Inspired Context-Selective Multimodal Memory for Social Robots

    cs.AI 2026-04 unverdicted novelty 5.0

    A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.

  24. Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

    cs.CL 2026-04 unverdicted novelty 5.0

    Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.

  25. On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

    cs.IR 2026-04 unverdicted novelty 5.0

    Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.

  26. Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition

    cs.CL 2026-04 unverdicted novelty 5.0

    ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...

  27. From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning

    q-bio.QM 2026-04 unverdicted novelty 5.0

    Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.

  28. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  29. Granite Embedding Multilingual R2 Models

    cs.IR 2026-05 unverdicted novelty 4.0

    Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.

  30. Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

    cs.IR 2026-04 unverdicted novelty 4.0

    A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.

  31. Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

    cs.CL 2026-04 unverdicted novelty 4.0

    Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...

  32. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

  33. Continual Learning with Multilingual Foundation Model

    cs.CL 2026-05 unverdicted novelty 3.0

    Framework using XLM-RoBERTa, back-translation augmentation, and language-specific thresholds detects reclaimed slurs with 2-5% F1 score gains.

  34. HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation

    cs.IR 2026-03 unverdicted novelty 3.0

    A multi-agent LLM system using CrewAI and RAG improves response coherence and correctness over a single-LLM RAG baseline for Brazilian labor law Q&A.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 32 Pith papers · 7 internal anchors

  1. [1]

    Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597--610

  2. [2]

    Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. https://arxiv.org/abs/1611.09268 Ms marco: A human generated machine reading comprehension dataset . ArXiv preprint, abs/1611.09268

  3. [3]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

  4. [5]

    DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. https://kaggle.com/competitions/quora-question-pairs Quora question pairs

  5. [7]

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891

  6. [9]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. https://arxiv.org/abs/2112.09118 Towards unsupervised dense information retrieval with contrastive learning . ArXiv preprint, abs/2112.09118

  7. [11]

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2orc: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969--4983

  8. [12]

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

  9. [15]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...

  10. [16]

    OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . ArXiv preprint, abs/2303.08774

  11. [17]

    Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, QiaoQiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. https://aclanthology.org/2022.emnlp-main.357 D u R eader-retrieval: A large-scale C hinese benchmark for passage retrieval from web search engine . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5326--5338, A...

  12. [20]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . ArXiv preprint, abs/2212.03533

  13. [22]

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140--2151

  14. [23]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. https://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding . ArXiv preprint, abs/2309.07597

  15. [24]

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483--498

  16. [27]

    Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114--1131

  17. [28]

    Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39--42

  18. [29]

    Dense passage retrieval for open-domain question answering

    Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau , booktitle =. Dense Passage Retrieval for Open-Domain Question Answering , url =. doi:10.18653/v1/2020.emnlp-main.550 , pages =

  19. [30]

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils , booktitle =

  20. [31]

    Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P

    Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas A. Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P. Schnurr and Felipe Petroski Such and Kenny Sai-Kin H...

  21. [32]

    Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =

    Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =

  22. [33]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =

    Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =

  23. [34]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

    Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle =. doi:10.18653/v1/2021.emnlp-main.552 , pages =

  24. [35]

    Large Dual Encoders Are Generalizable Retrievers , url =

    Ni, Jianmo and Qu, Chen and Lu, Jing and Dai, Zhuyun and Hernandez Abrego, Gustavo and Ma, Ji and Zhao, Vincent and Luan, Yi and Hall, Keith and Chang, Ming-Wei and Yang, Yinfei , booktitle =. Large Dual Encoders Are Generalizable Retrievers , url =

  25. [36]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. doi:10.18653/v1/D19-1410 , pages =

  26. [37]

    Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =

    Ni, Jianmo and Hernandez Abrego, Gustavo and Constant, Noah and Ma, Ji and Hall, Keith and Cer, Daniel and Yang, Yinfei , booktitle =. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =. doi:10.18653/v1/2022.findings-acl.146 , pages =

  27. [38]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =

    Daniel Fernando Campos and Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng and Bhaskar Mitra , journal =. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =

  28. [39]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , pages =

  29. [40]

    , booktitle =

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , pages =

  30. [41]

    Unsupervised Cross-lingual Representation Learning at Scale , url =

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm. Unsupervised Cross-lingual Representation Learning at Scale , url =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , doi =

  31. [42]

    GPT-4 Technical Report , url =

    OpenAI , journal =. GPT-4 Technical Report , url =

  32. [43]

    Text embeddings by weakly-supervised contrastive pre-training , url =

    Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal =. Text embeddings by weakly-supervised contrastive pre-training , url =

  33. [44]

    C-pack: Packaged resources to advance general chinese embedding , url =

    Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighof, Niklas , journal =. C-pack: Packaged resources to advance general chinese embedding , url =

  34. [45]

    MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =

    Xinyu Crystina Zhang and Nandan Thakur and Odunayo Ogundepo and Ehsan Kamalloo and David Alfonso-Hermelo and Xiaoguang Li and Qun Liu and Mehdi Rezagholizadeh and Jimmy Lin , journal =. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =

  35. [46]

    Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy , booktitle =. Mr. doi:10.18653/v1/2021.mrl-1.12 , pages =

  36. [47]

    Qiu, Yifu and Li, Hongyu and Qu, Yingqi and Chen, Ying and She, QiaoQiao and Liu, Jing and Wu, Hua and Wang, Haifeng , booktitle =

  37. [48]

    doi:10.18653/v1/P19-1346 , pages =

    Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. doi:10.18653/v1/P19-1346 , pages =

  38. [49]

    Quora Question Pairs , url =

    DataCanary and hilfialkaff and Jiang, Lili and Risdal, Meg and Dandekar, Nikhil and tomtung , publisher =. Quora Question Pairs , url =

  39. [50]

    arXiv preprint arXiv:2401.00368

    Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=

  40. [51]

    Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=

    Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=

  41. [52]

    Transactions of the Association for Computational Linguistics , year=

    Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , author=. Transactions of the Association for Computational Linguistics , year=

  42. [53]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

  43. [54]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

  44. [55]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    S2ORC: The Semantic Scholar Open Research Corpus , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  45. [56]

    arXiv preprint arXiv:2211.01786 , year=

    Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

  46. [57]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Language-agnostic BERT Sentence Embedding , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  47. [58]

    Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

    MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=