Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Domenico Desiato; Giandomenico Solimando; Giuseppe Polese; Stefano Cirillo

arxiv: 2605.23618 · v1 · pith:7CDTLE7Onew · submitted 2026-05-22 · 💻 cs.CL

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Stefano Cirillo , Domenico Desiato , Giuseppe Polese , Giandomenico Solimando This is my paper

Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords Google Embeddingsmultilingual retrievaldense embeddingsRAG systemsBEIR benchmarknDCGchunking ablationlatency profiling

0 comments

The pith

Google Embeddings 2 tops multilingual retrieval scores yet runs fourteen times slower than open-source rivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks Google Embeddings 2 against five open-source embedding models on four BEIR subsets and a synthetic Italian RAG corpus. GE2 records the highest nDCG@10 on every task but at a median latency of 231.6 ms. Multilingual-E5-large stays within 0.003 nDCG of GE2 on Italian data while running at 31 ms. LaBSE trails every dedicated retrieval model. Chunking tests show all models plateau at 32-token segments, with semantic chunking adding value only at 16 tokens.

Core claim

GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. LaBSE scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model. All six models saturate at 32-token chunks, with semantic chunking providing measurable gains only at 16 tokens.

What carries the argument

Side-by-side nDCG@10 evaluation on BEIR subsets and the Italian RAG corpus, plus CPU latency profiling and chunk-size ablations across five token lengths and three strategies.

If this is right

GE2 is the accuracy leader when latency constraints are absent.
mE5-L supplies near-equivalent Italian retrieval quality under strict latency limits.
LaBSE should be replaced by dedicated retrieval models such as mMPNet in new deployments.
Chunk sizes larger than 32 tokens bring no further gains on the tested corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG pipelines with real-time requirements will likely adopt open-source models to stay under 100 ms without meaningful accuracy loss.
The latency gap suggests hosted models may need GPU acceleration or caching to compete with local inference in production.
Extending the benchmark to additional languages would test whether mE5-L's near-parity generalizes beyond Italian.

Load-bearing premise

The chosen BEIR subsets, synthetic Italian corpus, chunking strategies, and commodity-CPU latency measurements fairly represent typical multilingual dense retrieval and RAG workloads.

What would settle it

Re-running the identical models on a new multilingual corpus or different hardware that reverses the accuracy or latency ordering between GE2 and mE5-L would disprove the reported rankings.

Figures

Figures reproduced from arXiv: 2605.23618 by Domenico Desiato, Giandomenico Solimando, Giuseppe Polese, Stefano Cirillo.

**Figure 2.** Figure 2: nDCG@10 vs. chunk size for Fixed, Semantic, and Sliding Window strategies on IT-RAG-Bench. All models saturate at [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GE2 leads the accuracy numbers on BEIR and the Italian corpus but the latency comparison folds in Vertex AI network overhead, so the 14x slowdown claim and SLA advice do not hold up cleanly.

read the letter

The main thing to know is that this paper supplies fresh benchmark scores for Google Embeddings 2 on four BEIR subsets and a new synthetic Italian RAG corpus, plus chunking ablations, but the latency comparison mixes hosted API costs with local inference and therefore does not support the claim that mE5-L is clearly preferable under sub-100 ms constraints. GE2 comes out first on every task with BEIR avg nDCG@10 of 0.638 and IT-RAG-Bench of 0.282, while mE5-L stays within 0.003 on the Italian data at much lower reported latency. The paper also notes that LaBSE trails the other retrieval models and that all models plateau at 32-token chunks with only modest semantic-chunking gains at 16 tokens. Those specific numbers and the saturation finding are the concrete additions worth having for anyone selecting multilingual retrievers. The work is straightforward empirical comparison and does not claim new methods or theory. The chunking results are easy to follow and the Italian corpus adds a practical data point that prior English-heavy benchmarks lack. The soft spot is the latency protocol. The abstract states GE2 runs at 231.6 ms median on commodity CPU hardware while the local models sit around 31 ms, yet GE2 is a Vertex AI service. That timing necessarily includes network round-trips and queuing that the local forward-pass numbers avoid, so the 14x factor and the resulting SLA recommendation rest on an uneven footing. The abstract gives no error bars, variance, or statistical tests, and the corpus is synthetic, which limits how far the rankings can be generalized. Citation coverage looks standard for this style of study. This paper is useful for practitioners who need quick model-selection data for multilingual RAG on Italian or similar languages. It is not aimed at readers seeking methodological advances or formal guarantees. I would send it for peer review once the latency measurement is clarified and basic statistical details are added; the empirical contribution is real even if the speed interpretation needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript benchmarks Google Embeddings 2 (GE2), a Vertex-AI-hosted bi-encoder, against five open-source models (BGE-M3, E5-large, mE5-L, LaBSE, mMPNet) on four BEIR subsets and a synthetic Italian RAG corpus. It reports that GE2 ranks first on all tasks with BEIR avg. nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but has 231.6 ms median latency (roughly 14x slower than mE5-L at 31 ms), recommending mE5-L for sub-100 ms SLAs. Additional results include LaBSE scoring only 0.188 avg. nDCG@10 and all models saturating at 32-token chunks with semantic chunking helping only at 16 tokens.

Significance. If the empirical protocol is sound, the work would offer practical guidance on trading off hosted vs. local multilingual embeddings for dense retrieval and RAG, plus evidence that LaBSE underperforms dedicated retrievers and that chunk size saturates early. The chunking ablation could inform preprocessing choices if the corpus and strategies are representative.

major comments (2)

[Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.
[Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.

minor comments (1)

[Abstract] The abstract is lengthy and contains many numeric claims; consider moving some detail to the body for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating revisions where the concerns are valid.

read point-by-point responses

Referee: [Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.

Authors: We agree that the latency protocol mixes end-to-end API measurements for GE2 (which include network, serialization, and queuing) with local CPU inference for the open-source models. This is a genuine limitation of the evaluation, as GE2 is only available through the hosted Vertex AI service and cannot be run locally for direct comparison. We will revise the abstract, results, and discussion sections to explicitly qualify the 231.6 ms figure and the 14x claim as end-to-end deployment latency rather than isolated model compute time. The practical recommendation for mE5-L under sub-100 ms SLAs will also be clarified to reflect deployment contexts. These changes will be made without altering the reported numbers. revision: yes
Referee: [Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.

Authors: We accept that the abstract and results lack sufficient detail on data splits, query counts, statistical tests, and error bars, which limits independent verification of the rankings and the small 0.003 gap. In the revised manuscript we will expand the experimental setup to report the exact train/test splits, query counts per benchmark, and any significance testing performed. For error bars, we will add them for any multi-run experiments and note the single-run nature of the primary results due to resource constraints. These additions will allow readers to assess the robustness of GE2 ranking first on all tasks. revision: yes

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or fitted predictions

full rationale

The paper performs direct experimental comparisons of embedding models on fixed benchmarks (BEIR subsets, synthetic Italian RAG corpus) using standard metrics (nDCG@10) and measured latencies. No equations, parameter fitting, predictions derived from inputs, or self-citations are used to support central claims. All results are obtained by running the models and recording outputs; the methodology contains no self-referential steps that reduce claims to their own inputs by construction. The latency comparison, while potentially methodologically debatable, is an empirical measurement rather than a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted parameters, or new postulated entities. All claims rest on experimental measurements of existing models.

pith-pipeline@v0.9.0 · 5799 in / 1203 out tokens · 65717 ms · 2026-05-25T04:22:20.086050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474

work page 2020
[2]

Large language models can be easily distracted by irrelevant context,

F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inProceedings of the 40th International Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227

work page 2023
[3]

Dense passage retrieval for open-domain question answering,

V . Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6769–6781

work page 2020
[4]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021

work page 2021
[5]

Language- agnostic bert sentence embedding,

F. Feng, Y . Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language- agnostic bert sentence embedding,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), 2022, pp. 878–891

work page 2022
[6]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,

N. Thakur, N. Reimers, A. R¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021
[7]

MTEB: Massive text embedding benchmark,

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive text embedding benchmark,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2006–2029

work page 2023
[8]

MIRACL: A multilingual retrieval dataset covering 18 diverse languages,

X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “MIRACL: A multilingual retrieval dataset covering 18 diverse languages,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023

work page 2023
[9]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

C-Pack: Packed Resources For General Chinese Embeddings

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”arXiv preprint arXiv:2309.07597, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

TextTiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997

work page 1997
[12]

Context embeddings for efficient answer generation in RAG,

D. Rau and A. Søgaard, “Context embeddings for efficient answer generation in RAG,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[1] [1]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474

work page 2020

[2] [2]

Large language models can be easily distracted by irrelevant context,

F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inProceedings of the 40th International Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227

work page 2023

[3] [3]

Dense passage retrieval for open-domain question answering,

V . Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6769–6781

work page 2020

[4] [4]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021

work page 2021

[5] [5]

Language- agnostic bert sentence embedding,

F. Feng, Y . Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language- agnostic bert sentence embedding,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), 2022, pp. 878–891

work page 2022

[6] [6]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,

N. Thakur, N. Reimers, A. R¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021

[7] [7]

MTEB: Massive text embedding benchmark,

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive text embedding benchmark,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2006–2029

work page 2023

[8] [8]

MIRACL: A multilingual retrieval dataset covering 18 diverse languages,

X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “MIRACL: A multilingual retrieval dataset covering 18 diverse languages,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023

work page 2023

[9] [9]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

C-Pack: Packed Resources For General Chinese Embeddings

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”arXiv preprint arXiv:2309.07597, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

TextTiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997

work page 1997

[12] [12]

Context embeddings for efficient answer generation in RAG,

D. Rau and A. Søgaard, “Context embeddings for efficient answer generation in RAG,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024