pith. sign in

arxiv: 2605.23618 · v1 · pith:7CDTLE7Onew · submitted 2026-05-22 · 💻 cs.CL

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3

classification 💻 cs.CL
keywords Google Embeddingsmultilingual retrievaldense embeddingsRAG systemsBEIR benchmarknDCGchunking ablationlatency profiling
0
0 comments X

The pith

Google Embeddings 2 tops multilingual retrieval scores yet runs fourteen times slower than open-source rivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks Google Embeddings 2 against five open-source embedding models on four BEIR subsets and a synthetic Italian RAG corpus. GE2 records the highest nDCG@10 on every task but at a median latency of 231.6 ms. Multilingual-E5-large stays within 0.003 nDCG of GE2 on Italian data while running at 31 ms. LaBSE trails every dedicated retrieval model. Chunking tests show all models plateau at 32-token segments, with semantic chunking adding value only at 16 tokens.

Core claim

GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. LaBSE scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model. All six models saturate at 32-token chunks, with semantic chunking providing measurable gains only at 16 tokens.

What carries the argument

Side-by-side nDCG@10 evaluation on BEIR subsets and the Italian RAG corpus, plus CPU latency profiling and chunk-size ablations across five token lengths and three strategies.

If this is right

  • GE2 is the accuracy leader when latency constraints are absent.
  • mE5-L supplies near-equivalent Italian retrieval quality under strict latency limits.
  • LaBSE should be replaced by dedicated retrieval models such as mMPNet in new deployments.
  • Chunk sizes larger than 32 tokens bring no further gains on the tested corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG pipelines with real-time requirements will likely adopt open-source models to stay under 100 ms without meaningful accuracy loss.
  • The latency gap suggests hosted models may need GPU acceleration or caching to compete with local inference in production.
  • Extending the benchmark to additional languages would test whether mE5-L's near-parity generalizes beyond Italian.

Load-bearing premise

The chosen BEIR subsets, synthetic Italian corpus, chunking strategies, and commodity-CPU latency measurements fairly represent typical multilingual dense retrieval and RAG workloads.

What would settle it

Re-running the identical models on a new multilingual corpus or different hardware that reverses the accuracy or latency ordering between GE2 and mE5-L would disprove the reported rankings.

Figures

Figures reproduced from arXiv: 2605.23618 by Domenico Desiato, Giandomenico Solimando, Giuseppe Polese, Stefano Cirillo.

Figure 1
Figure 1. Figure 1: Per-query latency (ms) vs. nDCG@10 (BEIR avg.). Dashed line: [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: nDCG@10 vs. chunk size for Fixed, Semantic, and Sliding Window strategies on IT-RAG-Bench. All models saturate at [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript benchmarks Google Embeddings 2 (GE2), a Vertex-AI-hosted bi-encoder, against five open-source models (BGE-M3, E5-large, mE5-L, LaBSE, mMPNet) on four BEIR subsets and a synthetic Italian RAG corpus. It reports that GE2 ranks first on all tasks with BEIR avg. nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but has 231.6 ms median latency (roughly 14x slower than mE5-L at 31 ms), recommending mE5-L for sub-100 ms SLAs. Additional results include LaBSE scoring only 0.188 avg. nDCG@10 and all models saturating at 32-token chunks with semantic chunking helping only at 16 tokens.

Significance. If the empirical protocol is sound, the work would offer practical guidance on trading off hosted vs. local multilingual embeddings for dense retrieval and RAG, plus evidence that LaBSE underperforms dedicated retrievers and that chunk size saturates early. The chunking ablation could inform preprocessing choices if the corpus and strategies are representative.

major comments (2)
  1. [Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.
  2. [Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.
minor comments (1)
  1. [Abstract] The abstract is lengthy and contains many numeric claims; consider moving some detail to the body for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [Latency evaluation] Latency protocol (abstract and results): GE2 latency of 231.6 ms median is measured via Vertex AI API while the five baselines use local inference on commodity CPU; the protocol therefore folds in network round-trips, serialization, and queuing. This directly undermines the 14x slowdown claim and the SLA-based recommendation to prefer mE5-L under 100 ms, because the numbers do not isolate model compute cost.

    Authors: We agree that the latency protocol mixes end-to-end API measurements for GE2 (which include network, serialization, and queuing) with local CPU inference for the open-source models. This is a genuine limitation of the evaluation, as GE2 is only available through the hosted Vertex AI service and cannot be run locally for direct comparison. We will revise the abstract, results, and discussion sections to explicitly qualify the 231.6 ms figure and the 14x claim as end-to-end deployment latency rather than isolated model compute time. The practical recommendation for mE5-L under sub-100 ms SLAs will also be clarified to reflect deployment contexts. These changes will be made without altering the reported numbers. revision: yes

  2. Referee: [Experimental setup and results] Experimental setup and results: the abstract states precise nDCG@10 rankings and the 0.003 gap on Italian without reporting data splits, query counts, statistical tests, or error bars. This prevents verification that GE2 truly ranks first on every task and that the mE5-L comparison is robust.

    Authors: We accept that the abstract and results lack sufficient detail on data splits, query counts, statistical tests, and error bars, which limits independent verification of the rankings and the small 0.003 gap. In the revised manuscript we will expand the experimental setup to report the exact train/test splits, query counts per benchmark, and any significance testing performed. For error bars, we will add them for any multi-run experiments and note the single-run nature of the primary results due to resource constraints. These additions will allow readers to assess the robustness of GE2 ranking first on all tasks. revision: yes

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or fitted predictions

full rationale

The paper performs direct experimental comparisons of embedding models on fixed benchmarks (BEIR subsets, synthetic Italian RAG corpus) using standard metrics (nDCG@10) and measured latencies. No equations, parameter fitting, predictions derived from inputs, or self-citations are used to support central claims. All results are obtained by running the models and recording outputs; the methodology contains no self-referential steps that reduce claims to their own inputs by construction. The latency comparison, while potentially methodologically debatable, is an empirical measurement rather than a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted parameters, or new postulated entities. All claims rest on experimental measurements of existing models.

pith-pipeline@v0.9.0 · 5799 in / 1203 out tokens · 65717 ms · 2026-05-25T04:22:20.086050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474

  2. [2]

    Large language models can be easily distracted by irrelevant context,

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inProceedings of the 40th International Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227

  3. [3]

    Dense passage retrieval for open-domain question answering,

    V . Karpukhin, B. O˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6769–6781

  4. [4]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021

  5. [5]

    Language- agnostic bert sentence embedding,

    F. Feng, Y . Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language- agnostic bert sentence embedding,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), 2022, pp. 878–891

  6. [6]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,

    N. Thakur, N. Reimers, A. R¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  7. [7]

    MTEB: Massive text embedding benchmark,

    N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive text embedding benchmark,” inProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2006–2029

  8. [8]

    MIRACL: A multilingual retrieval dataset covering 18 diverse languages,

    X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “MIRACL: A multilingual retrieval dataset covering 18 diverse languages,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1114–1131, 2023

  9. [9]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre- training,”arXiv preprint arXiv:2212.03533, 2022

  10. [10]

    C-Pack: Packed Resources For General Chinese Embeddings

    J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”arXiv preprint arXiv:2309.07597, 2024

  11. [11]

    TextTiling: Segmenting text into multi-paragraph subtopic passages,

    M. A. Hearst, “TextTiling: Segmenting text into multi-paragraph subtopic passages,”Computational Linguistics, vol. 23, no. 1, pp. 33–64, 1997

  12. [12]

    Context embeddings for efficient answer generation in RAG,

    D. Rau and A. Søgaard, “Context embeddings for efficient answer generation in RAG,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024