Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3
The pith
Google Embeddings 2 tops multilingual retrieval scores yet runs fourteen times slower than open-source rivals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. LaBSE scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model. All six models saturate at 32-token chunks, with semantic chunking providing measurable gains only at 16 tokens.
What carries the argument
Side-by-side nDCG@10 evaluation on BEIR subsets and the Italian RAG corpus, plus CPU latency profiling and chunk-size ablations across five token lengths and three strategies.
If this is right
- GE2 is the accuracy leader when latency constraints are absent.
- mE5-L supplies near-equivalent Italian retrieval quality under strict latency limits.
- LaBSE should be replaced by dedicated retrieval models such as mMPNet in new deployments.
- Chunk sizes larger than 32 tokens bring no further gains on the tested corpus.
Where Pith is reading between the lines
- RAG pipelines with real-time requirements will likely adopt open-source models to stay under 100 ms without meaningful accuracy loss.
- The latency gap suggests hosted models may need GPU acceleration or caching to compete with local inference in production.
- Extending the benchmark to additional languages would test whether mE5-L's near-parity generalizes beyond Italian.
Load-bearing premise
The chosen BEIR subsets, synthetic Italian corpus, chunking strategies, and commodity-CPU latency measurements fairly represent typical multilingual dense retrieval and RAG workloads.
What would settle it
Re-running the identical models on a new multilingual corpus or different hardware that reverses the accuracy or latency ordering between GE2 and mE5-L would disprove the reported rankings.
Figures
read the original abstract
We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks Google Embeddings 2 (GE2), a Vertex-AI-hosted bi-encoder, against five open-source models (BGE-M3, E5-large, mE5-L, LaBSE, mMPNet) on four BEIR subsets and a synthetic Italian RAG corpus. It reports that GE2 ranks first on all tasks with BEIR avg. nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but has 231.6 ms median latency (roughly 14x slower than mE5-L at 31 ms), recommending mE5-L for sub-100 ms SLAs. Additional results include LaBSE scoring only 0.188 avg. nDCG@10 and all models saturating at 32-token chunks with semantic chunking helping only at 16 tokens.
Significance. If the empirical protocol is sound, the work would offer practical guidance on trading off hosted vs. local multilingual embeddings for dense retrieval and RAG, plus evidence that LaBSE underperforms dedicated retrievers and that chunk size saturates early. The chunking ablation could inform preprocessing choices if the corpus and strategies are representative.
major comments (2)
- [Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.
- [Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.
minor comments (1)
- [Abstract] The abstract is lengthy and contains many numeric claims; consider moving some detail to the body for improved readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating revisions where the concerns are valid.
read point-by-point responses
-
Referee: [Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.
Authors: We agree that the latency protocol mixes end-to-end API measurements for GE2 (which include network, serialization, and queuing) with local CPU inference for the open-source models. This is a genuine limitation of the evaluation, as GE2 is only available through the hosted Vertex AI service and cannot be run locally for direct comparison. We will revise the abstract, results, and discussion sections to explicitly qualify the 231.6 ms figure and the 14x claim as end-to-end deployment latency rather than isolated model compute time. The practical recommendation for mE5-L under sub-100 ms SLAs will also be clarified to reflect deployment contexts. These changes will be made without altering the reported numbers. revision: yes
-
Referee: [Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.
Authors: We accept that the abstract and results lack sufficient detail on data splits, query counts, statistical tests, and error bars, which limits independent verification of the rankings and the small 0.003 gap. In the revised manuscript we will expand the experimental setup to report the exact train/test splits, query counts per benchmark, and any significance testing performed. For error bars, we will add them for any multi-run experiments and note the single-run nature of the primary results due to resource constraints. These additions will allow readers to assess the robustness of GE2 ranking first on all tasks. revision: yes
Circularity Check
Purely empirical benchmarking with no derivations or fitted predictions
full rationale
The paper performs direct experimental comparisons of embedding models on fixed benchmarks (BEIR subsets, synthetic Italian RAG corpus) using standard metrics (nDCG@10) and measured latencies. No equations, parameter fitting, predictions derived from inputs, or self-citations are used to support central claims. All results are obtained by running the models and recording outputs; the methodology contains no self-referential steps that reduce claims to their own inputs by construction. The latency comparison, while potentially methodologically debatable, is an empirical measurement rather than a derived result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Retrieval-augmented generation for knowledge-intensive NLP tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474
work page 2020
-
[2]
Large language models can be easily distracted by irrelevant context,
F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inProceedings of the 40th International Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227
work page 2023
-
[3]
Dense passage retrieval for open-domain question answering,
V . Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6769–6781
work page 2020
-
[4]
Billion-scale similarity search with GPUs,
J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021
work page 2021
-
[5]
Language- agnostic bert sentence embedding,
F. Feng, Y . Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language- agnostic bert sentence embedding,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), 2022, pp. 878–891
work page 2022
-
[6]
BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,
N. Thakur, N. Reimers, A. R¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021
work page 2021
-
[7]
MTEB: Massive text embedding benchmark,
N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive text embedding benchmark,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2006–2029
work page 2023
-
[8]
MIRACL: A multilingual retrieval dataset covering 18 diverse languages,
X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “MIRACL: A multilingual retrieval dataset covering 18 diverse languages,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023
work page 2023
-
[9]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
C-Pack: Packed Resources For General Chinese Embeddings
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”arXiv preprint arXiv:2309.07597, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
TextTiling: Segmenting text into multi-paragraph subtopic passages,
M. A. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997
work page 1997
-
[12]
Context embeddings for efficient answer generation in RAG,
D. Rau and A. Søgaard, “Context embeddings for efficient answer generation in RAG,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.