Recognition: 2 theorem links
· Lean TheoremMultilingual E5 Text Embeddings: A Technical Report
Pith reviewed 2026-05-12 19:12 UTC · model grok-4.3
The pith
Multilingual E5 embeddings match English state-of-the-art performance using the same training recipe
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The training methodology from the English E5 model transfers directly to multilingual data through contrastive pre-training on 1 billion multilingual text pairs followed by fine-tuning on a combination of labeled datasets, yielding three embedding models of different sizes whose performance is competitive across languages, plus an instruction-tuned model that reaches parity with state-of-the-art English-only models of similar sizes.
What carries the argument
Contrastive pre-training on 1 billion multilingual text pairs, which builds general cross-lingual representations before supervised fine-tuning adapts them for downstream use.
Load-bearing premise
The assumption that the training methodology from the English E5 model can be directly applied to multilingual data to achieve comparable performance without language-specific adjustments or biases.
What would settle it
A large-scale evaluation on diverse multilingual benchmarks where the new models show substantially lower performance than English-only counterparts on cross-lingual tasks, or where language-specific fine-tuning proves necessary for competitive results, would disprove the direct-transfer claim.
read the original abstract
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This technical report describes the training of open-source multilingual E5 text embedding models in three sizes (small, base, large). The procedure follows the English E5 recipe exactly: contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on labeled datasets. It additionally introduces an instruction-tuned variant whose performance is claimed to be on par with English-only SOTA models of similar size. Models are released at the provided GitHub link.
Significance. If the performance claims are substantiated, the work delivers practical open-source multilingual embeddings that could reduce reliance on English-centric models for cross-lingual tasks while maintaining competitive quality. The public release of models and the direct reuse of a proven training recipe support reproducibility and adoption in the community.
major comments (2)
- [Training Methodology and Evaluation sections] The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
- [Evaluation results] Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.
minor comments (2)
- [Model description] Model sizes (parameter counts) for 'small / base / large' are stated but not tabulated with exact figures; add a table row for clarity.
- [Abstract] The abstract's performance claim would benefit from naming the specific English SOTA baselines and metrics (e.g., MTEB scores) rather than the generic phrase 'on par with state-of-the-art'.
Simulated Author's Rebuttal
We appreciate the referee's thorough review of our technical report on Multilingual E5 Text Embeddings. We address each major comment in detail below and outline the changes we plan to make in the revised manuscript.
read point-by-point responses
-
Referee: The central performance-parity claim for the instruction-tuned model rests on direct transfer of the English E5 recipe to 1B multilingual pairs, yet no ablation isolates the multilingual pre-training effect (e.g., versus an English-only baseline trained on equivalent volume). This omission leaves open the possibility of negative transfer or data-imbalance effects and is load-bearing for the 'on par with SOTA' assertion.
Authors: We acknowledge that an ablation study comparing the multilingual pre-training to an English-only baseline on the same data volume would provide valuable insights into potential negative transfer or data imbalance effects. However, training an additional model on 1 billion pairs requires significant computational resources that were not available for this technical report. The manuscript demonstrates that applying the English E5 recipe to multilingual data yields models whose performance is on par with English SOTA models on relevant benchmarks. This outcome indicates that any negative transfer effects are not substantial enough to prevent achieving competitive results. In the revised manuscript, we will expand the discussion section to explicitly address this point and suggest it as an avenue for future research. revision: partial
-
Referee: Results tables present aggregate scores but lack per-language breakdowns or language-balance statistics for the 1B-pair corpus. Without these, it is impossible to verify that low-resource languages do not degrade overall performance or that the parity claim holds uniformly.
Authors: We agree with the referee that per-language breakdowns and language-balance statistics would improve the transparency of our results. The 1 billion text pairs corpus was curated to include a diverse set of languages, with efforts to balance representation where data availability permitted. In the revised version of the manuscript, we will add language distribution statistics for the pre-training corpus and include per-language performance metrics for the evaluation datasets where such breakdowns are feasible and meaningful. revision: yes
Circularity Check
No circularity: empirical training report with no derivations or self-referential reductions.
full rationale
The manuscript is a technical report describing contrastive pre-training on 1B multilingual pairs followed by fine-tuning, explicitly adhering to the prior English E5 recipe and reporting downstream evaluation scores. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The performance parity claim is an empirical observation, not a derivation that reduces to its inputs by construction. Self-citation to the English E5 work is present but not load-bearing for any mathematical result; the central content remains independent training and evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 34 Pith papers
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
RARE builds redundancy-aware benchmarks via atomic fact decomposition and CRRF-enhanced LLM generation, showing retriever PerfRecall@10 dropping from 66.4% on general data to 5.0-27.9% on high-similarity finance/legal...
-
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...
-
Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
-
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
-
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
-
JFinTEB: Japanese Financial Text Embedding Benchmark
JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.
-
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
QOuLiPo: What a quantum computer sees when it reads a book
Literary texts are turned into graphs for neutral-atom quantum processors, with a new rigidity metric distinguishing structural uniqueness and a QOuLiPo corpus of engineered texts created to match hardware-native graphs.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
-
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...
-
Human-Inspired Context-Selective Multimodal Memory for Social Robots
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
-
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
-
On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework
Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.
-
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...
-
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Granite Embedding Multilingual R2 Models
Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
-
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
-
Continual Learning with Multilingual Foundation Model
Framework using XLM-RoBERTa, back-translation augmentation, and language-specific thresholds detects reclaimed slurs with 2-5% F1 score gains.
-
HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation
A multi-agent LLM system using CrewAI and RAG improves response coherence and correctness over a single-LLM RAG baseline for Brazilian labor law Q&A.
Reference graph
Works this paper leans on
-
[1]
Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597--610
work page 2019
-
[2]
Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. https://arxiv.org/abs/1611.09268 Ms marco: A human generated machine reading comprehension dataset . ArXiv preprint, abs/1611.09268
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...
-
[5]
DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. 2017. https://kaggle.com/competitions/quora-question-pairs Quora question pairs
work page 2017
-
[7]
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891
work page 2022
-
[9]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. https://arxiv.org/abs/2112.09118 Towards unsupervised dense information retrieval with contrastive learning . ArXiv preprint, abs/2112.09118
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S Weld. 2020. S2orc: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969--4983
work page 2020
-
[12]
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics
work page 2023
-
[15]
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...
work page 2022
-
[16]
OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . ArXiv preprint, abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, QiaoQiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. https://aclanthology.org/2022.emnlp-main.357 D u R eader-retrieval: A large-scale C hinese benchmark for passage retrieval from web search engine . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5326--5338, A...
work page 2022
-
[20]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . ArXiv preprint, abs/2212.03533
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140--2151
work page 2021
-
[23]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. https://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding . ArXiv preprint, abs/2309.07597
work page internal anchor Pith review arXiv 2023
-
[24]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483--498
work page 2021
-
[27]
Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114--1131
work page 2023
-
[28]
Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pages 39--42
work page 2018
-
[29]
Dense passage retrieval for open-domain question answering
Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau , booktitle =. Dense Passage Retrieval for Open-Domain Question Answering , url =. doi:10.18653/v1/2020.emnlp-main.550 , pages =
-
[30]
Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils , booktitle =
-
[31]
Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas A. Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David P. Schnurr and Felipe Petroski Such and Kenny Sai-Kin H...
-
[32]
Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =
Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Towards Unsupervised Dense Information Retrieval with Contrastive Learning , url =
-
[33]
Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , title =
-
[34]
SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =
Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle =. doi:10.18653/v1/2021.emnlp-main.552 , pages =
-
[35]
Large Dual Encoders Are Generalizable Retrievers , url =
Ni, Jianmo and Qu, Chen and Lu, Jing and Dai, Zhuyun and Hernandez Abrego, Gustavo and Ma, Ji and Zhao, Vincent and Luan, Yi and Hall, Keith and Chang, Ming-Wei and Yang, Yinfei , booktitle =. Large Dual Encoders Are Generalizable Retrievers , url =
-
[36]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. doi:10.18653/v1/D19-1410 , pages =
-
[37]
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =
Ni, Jianmo and Hernandez Abrego, Gustavo and Constant, Noah and Ma, Ji and Hall, Keith and Cer, Daniel and Yang, Yinfei , booktitle =. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , url =. doi:10.18653/v1/2022.findings-acl.146 , pages =
-
[38]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =
Daniel Fernando Campos and Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng and Bhaskar Mitra , journal =. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , url =
-
[39]
FEVER: a large-scale dataset for Fact Extraction and VERification
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , pages =
work page internal anchor Pith review doi:10.18653/v1/n18-1074
-
[40]
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , pages =
-
[41]
Unsupervised Cross-lingual Representation Learning at Scale , url =
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm. Unsupervised Cross-lingual Representation Learning at Scale , url =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , doi =
- [42]
-
[43]
Text embeddings by weakly-supervised contrastive pre-training , url =
Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal =. Text embeddings by weakly-supervised contrastive pre-training , url =
-
[44]
C-pack: Packaged resources to advance general chinese embedding , url =
Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighof, Niklas , journal =. C-pack: Packaged resources to advance general chinese embedding , url =
-
[45]
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =
Xinyu Crystina Zhang and Nandan Thakur and Odunayo Ogundepo and Ehsan Kamalloo and David Alfonso-Hermelo and Xiaoguang Li and Qun Liu and Mehdi Rezagholizadeh and Jimmy Lin , journal =. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages , volume =
-
[46]
Zhang, Xinyu and Ma, Xueguang and Shi, Peng and Lin, Jimmy , booktitle =. Mr. doi:10.18653/v1/2021.mrl-1.12 , pages =
-
[47]
Qiu, Yifu and Li, Hongyu and Qu, Yingqi and Chen, Ying and She, QiaoQiao and Liu, Jing and Wu, Hua and Wang, Haifeng , booktitle =
-
[48]
doi:10.18653/v1/P19-1346 , pages =
Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. doi:10.18653/v1/P19-1346 , pages =
-
[49]
DataCanary and hilfialkaff and Jiang, Lili and Risdal, Meg and Dandekar, Nikhil and tomtung , publisher =. Quora Question Pairs , url =
-
[50]
arXiv preprint arXiv:2401.00368
Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=
-
[51]
Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=
Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora , author=. Proceedings of 11th Workshop on Building and Using Comparable Corpora , pages=
-
[52]
Transactions of the Association for Computational Linguistics , year=
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , author=. Transactions of the Association for Computational Linguistics , year=
-
[53]
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2021
-
[54]
No Language Left Behind: Scaling Human-Centered Machine Translation
No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
S2ORC: The Semantic Scholar Open Research Corpus , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[56]
arXiv preprint arXiv:2211.01786 , year=
Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=
-
[57]
Language-agnostic BERT Sentence Embedding , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[58]
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.