arxiv: 2605.13521 · v1 · submitted 2026-05-13 · 💻 cs.IR

Recognition: no theorem link

Granite Embedding Multilingual R2 Models

Parul Awasthy , Aashka Trivedi , Yushu Yang , Ken Barker , Yulong Li , Bhavani Iyer , Martin Franz , Meet Doshi

show 10 more authors

Riyaz Bhat Vignesh P Vishwajeet Kumar Todd Ward Abraham Daniels Rudra Murthy Madison Lee Luis Lastras Jaydeep Sen Radu Florian

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:12 UTC · model grok-4.3

classification 💻 cs.IR

keywords multilingual embeddingsdense retrievalcode retrievallong document retrievalModernBERTMatryoshka representationsembedding modelsenterprise search

0 comments

The pith

Granite R2 multilingual embedding models achieve state-of-the-art retrieval across more than 200 languages and code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of encoder-based embedding models for dense retrieval that extend prior English work to support 52 languages plus programming code. These bi-encoder models use the ModernBERT architecture with an expanded vocabulary and increase the context window to 32,768 tokens. A full 311-million-parameter version and a pruned 97-million-parameter compact version both report leading results on multilingual text search, code retrieval, long-document search, and reasoning retrieval benchmarks. The release targets enterprise use with governed training data and an open Apache 2.0 license.

Core claim

The authors establish that scaling context length to 32,768 tokens while expanding vocabulary support for 52 languages and code, within bi-encoder models built on ModernBERT and trained on enterprise-appropriate data, produces state-of-the-art performance on multilingual and cross-lingual retrieval tasks, with the larger model additionally enabling Matryoshka representation learning for variable embedding sizes.

What carries the argument

The multilingual Granite Embedding R2 models as bi-encoders based on ModernBERT with expanded multilingual vocabulary and Matryoshka support in the full-size variant.

If this is right

A single model family can handle dense retrieval for text, code, long documents, and reasoning tasks in over 200 languages.
The 97M-parameter pruned model delivers the highest retrieval score among any open multilingual embedding model under 100M parameters.
Matryoshka representation learning in the full-size model permits trading embedding dimensionality for speed or storage without retraining.
Open release under Apache 2.0 with governance oversight enables direct integration into enterprise search systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could replace multiple language-specific retrievers with one multilingual model, lowering maintenance overhead.
The pruning and vocabulary-selection approach used for the compact model offers a template for shrinking other embedding architectures while preserving accuracy.
The 32k-token context may improve retrieval quality on lengthy technical or legal documents that exceed typical shorter windows.

Load-bearing premise

The reported benchmark results accurately measure generalization to real enterprise data rather than reflecting tuning or overfitting to the specific evaluation sets.

What would settle it

Retrieval accuracy measured on a fresh multilingual enterprise dataset collected independently of the training data and prior benchmarks would need to fall below current leading scores to falsify the state-of-the-art claim.

Figures

Figures reproduced from arXiv: 2605.13521 by Aashka Trivedi, Abraham Daniels, Bhavani Iyer, Jaydeep Sen, Ken Barker, Luis Lastras, Madison Lee, Martin Franz, Meet Doshi, Parul Awasthy, Radu Florian, Riyaz Bhat, Rudra Murthy, Todd Ward, Vignesh P, Vishwajeet Kumar, Yulong Li, Yushu Yang.

**Figure 2.** Figure 2: Speed vs. accuracy Pareto frontier for multilingual embedding models. The x-axis shows [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Granite Multilingual R2 is a practical open model release with claimed SOTA performance that still needs full benchmark details to verify.

read the letter

The main thing to know is that this paper ships two open embedding models under Apache 2.0: a 311M-parameter ModernBERT-based encoder and a pruned 97M-parameter compact version, both extended from the prior English R2 work to handle 52 languages plus code with a 32k context window and Matryoshka support for flexible dimensions. The compact model is positioned as the strongest open option under 100M parameters on retrieval tasks. That is the concrete contribution here—an engineering release aimed at enterprise multilingual and code search use cases rather than a new theoretical approach. The models are hosted on Hugging Face, which lowers the barrier for people who need ready-to-use retrievers with governance notes on the training data. The pruning and vocabulary expansion steps are standard but executed at a scale that could be useful for practitioners who want smaller footprints without retraining from scratch. The soft spot is the performance evidence. The abstract asserts state-of-the-art results across multilingual text, cross-lingual, code, long-document, and reasoning retrieval, yet the available text gives no training corpus breakdown, no per-dataset scores with baselines, no error bars, and no evaluation protocol specifics. Without those, it is difficult to confirm whether the numbers reflect genuine gains or conditions specific to the test sets. If the full manuscript includes the tables and ablations, that would address the gap; otherwise the paper rests more on the released artifacts than on independently verifiable superiority. This work is aimed at engineers building retrieval systems for global or code-heavy applications who can download and test the models directly. Readers focused on practical deployment will find value in the open weights and context length increase. It deserves a serious referee because the models are released and the claims are testable once the missing details are supplied, even though the underlying architecture follows existing ModernBERT patterns.

Referee Report

2 major / 0 minor

Summary. The paper introduces the multilingual Granite Embedding R2 models, a family of encoder-based bi-encoder embedding models for enterprise-scale dense retrieval across 200+ languages. Extending prior English R1 work, it presents two ModernBERT-based variants (311M-parameter full model and 97M-parameter pruned compact model) with expanded multilingual vocabulary, 32,768-token context, support for 52 languages plus code, and Matryoshka Representation Learning on the larger model. The central claim is state-of-the-art overall performance on multilingual/cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets, with both models trained on enterprise-appropriate data and released under Apache 2.0.

Significance. If the empirical claims can be verified with full training details and benchmark tables, the work would deliver practically useful open models for multilingual enterprise retrieval, especially the compact 97M variant that reportedly achieves the highest score among open models under 100M parameters. The 64x context expansion and code support represent meaningful extensions, and the governance-focused training data is a positive for responsible deployment. However, the current manuscript text supplies no numbers, baselines, or protocols, so the significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central claim of 'state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets' is asserted without any reported scores, baseline comparisons, error bars, dataset names, or evaluation protocols (e.g., pooling method, normalization, or Matryoshka dimensionality choices). This absence makes the performance claim unverifiable from the manuscript.
[Abstract] Abstract: no description is given of the training corpus, the 52-language + code mixture, exact benchmark tables with per-dataset results, or any ablation studies. These details are load-bearing for the generalization claim that the models outperform prior open multilingual embeddings rather than matching them under undisclosed conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting the need for greater specificity in the abstract. We agree that the abstract should be self-contained and will revise it to include key numerical results, dataset references, and protocol notes drawn from the full manuscript's Experiments and Training sections. This will make the SOTA claims verifiable without requiring readers to consult later sections. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets' is asserted without any reported scores, baseline comparisons, error bars, dataset names, or evaluation protocols (e.g., pooling method, normalization, or Matryoshka dimensionality choices). This absence makes the performance claim unverifiable from the manuscript.

Authors: We agree the abstract must include concrete evidence to support the central claim. The full manuscript reports these details in Section 5 (Experiments), including per-dataset scores on MTEB multilingual subsets, CodeSearchNet, LongBench, and reasoning tasks, with comparisons to baselines such as mBERT, XLM-R, and other open multilingual models. Evaluation follows standard dense retrieval protocols (e.g., cosine similarity, top-k recall, no pooling beyond mean pooling on embeddings). We will revise the abstract to cite the highest-scoring datasets and note the Matryoshka dimensions used (e.g., 768 and 256). Error bars are omitted as single-run results are standard in this literature, but we can add a statement on reproducibility if required. revision: yes
Referee: [Abstract] Abstract: no description is given of the training corpus, the 52-language + code mixture, exact benchmark tables with per-dataset results, or any ablation studies. These details are load-bearing for the generalization claim that the models outperform prior open multilingual embeddings rather than matching them under undisclosed conditions.

Authors: The manuscript does describe the training corpus (enterprise-curated mix covering 52 languages plus code) and the 52-language + code mixture in Section 3 (Data), with exact token counts and language distribution. Benchmark tables with per-dataset results appear in Tables 2–4, and ablations on vocabulary expansion, context length (32k), and pruning appear in Section 4.3. To address the concern directly in the abstract, we will add a concise clause referencing the data scale and directing readers to the tables for per-dataset breakdowns. This strengthens the generalization claim without altering the underlying results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external benchmarks without reducing to self-fitted inputs

full rationale

The paper presents no mathematical derivations, equations, or parameter-fitting steps that could reduce to prior results by construction. Performance claims are framed as empirical outcomes on multilingual, code, and reasoning retrieval datasets. The single self-reference to an 'English-focused R2 release' is descriptive background and does not carry any load-bearing uniqueness theorem, ansatz, or prediction that is redefined as output. No self-definitional loops, fitted-input predictions, or imported uniqueness results appear. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work relies on standard embedding model training assumptions and ModernBERT architecture from prior literature.

pith-pipeline@v0.9.0 · 5553 in / 1032 out tokens · 86745 ms · 2026-05-14T18:12:51.598107+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Phi-4 Technical Report

URLhttps://arxiv.org/abs/2412.08905. Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael G¨unther, Max- imilian Werk, and Han Xiao. jina-embeddings-v5-text: Task-targeted embedding distillation,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

URLhttps://arxiv.org/abs/2602.15547. Dimo Angelov. Top2vec: Distributed representations of topics.CoRR, abs/2008.09470,

work page internal anchor Pith review Pith/arXiv arXiv 2008
[3]

URL https://arxiv.org/abs/2008.09470. Parul Awasthy, Aashka Trivedi, Yulong Li, Mihaela Bornea, David Cox, Abraham Daniels, Martin Franz, Gabe Goodhart, Bhavani Iyer, Vishwajeet Kumar, Luis Lastras, Scott McCarley, Rudra Murthy, Vig- nesh P, Sara Rosenthal, Salim Roukos, Jaydeep Sen, Sukriti Sharma, Avirup Sil, Kate Soule, Arafat Sultan, and Radu Florian....

work page arXiv 2008
[4]

URL https://arxiv.org/abs/ 2502.13595

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M ´arton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Math- ieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, ¨Omer C ¸ a˘gatan, Akash Kundu, Martin Bernsto...

work page arXiv
[5]

URL https://arxiv.org/abs/ 2502.13595

doi: 10.48550/arXiv.2502.13595. URLhttps: //arxiv.org/abs/2502.13595. Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence em- beddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–...

work page doi:10.48550/arxiv.2502.13595 2021
[6]

SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

Association for Com- putational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URLhttps://aclanthology. org/2021.emnlp-main.552. Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large lan- guage models. In Franck Dernoncou...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[7]

doi: 10.18653/v1/2024.emnlp-industry.36

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.emnlp-industry.36. URLhttps://aclanthology.org/2024. emnlp-industry.36. Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, N...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[8]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping

URLhttps://arxiv.org/abs/2205.13147. Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models,

work page arXiv
[9]

12 IBM Granite Embeddings Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang

URLhttps://arxiv.org/abs/2407.02883. 12 IBM Granite Embeddings Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning,

work page arXiv
[10]

Towards General Text Embeddings with Multi-stage Contrastive Learning

URLhttps://arxiv.org/ abs/2308.03281. Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad´e, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Am´elie H´eliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Soori- yarachchi, Baptiste Bout, ...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Ministral 3

URL https://arxiv.org/abs/2601.08584. Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content,

work page internal anchor Pith review arXiv
[12]

mmbert: A modern multilingual encoder with annealed language learning,

URLhttps:// arxiv.org/abs/2509.06888. Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models,

work page arXiv
[13]

Text and Code Embeddings by Contrastive Pre-Training

URLhttps: //arxiv.org/abs/2201.10005. Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder,

work page internal anchor Pith review arXiv
[14]

arXiv preprint arXiv:2506.20920 , year=

URLhttps: //arxiv.org/abs/2506.20920. 13 IBM Granite Embeddings Yousaf Shah. Granite 4.1 llms: How they’re built,

work page arXiv
[15]

Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

URLhttps://arxiv.org/abs/2504.20595. Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text classification? In China national conference on Chinese computational linguistics, pp. 194–206. Springer,

work page arXiv
[16]

URL https://arxiv.org/abs/2503.19786. Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Pa- nyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lu- cas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk L...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

EmbeddingGemma: Powerful and Lightweight Text Representations

URL https://arxiv.org/abs/2509.20354. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review arXiv
[18]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024a. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024b. Benjamin Warner, Antoine Chaffin, Benjami...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Rar-b: Reasoning as retrieval benchmark

Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed. Rar-b: Reasoning as retrieval benchmark. arXiv preprint arXiv:2404.06347,

work page arXiv
[20]

N., Ahmed, J., and Overwijk, A

URLhttps://arxiv.org/abs/2007.00808. Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. Arctic-embed 2.0: Multilingual retrieval without compromise,

work page arXiv 2007
[21]

ISBN 9781450360142

Association for Computing Machinery. ISBN 9781450360142. doi: 10.1145/3269206.3271800. URLhttps://doi.org/10.1145/3269206.3271800. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. mgte: Generalized long-context text representation and reranking...

work page doi:10.1145/3269206.3271800
[22]

Zhang, Y

URL https://arxiv.org/abs/2407.19669. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

work page arXiv
[23]

Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li

URLhttps://arxiv.org/abs/2009.13013. Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Longembed: Extending embedding models for long context retrieval,

work page arXiv 2009
[24]

Parul Awasthy was the challenge lead on the project overall, calling from WRL, with Jaydeep Sen coordinating the work from IRL

15 IBM Granite Embeddings A CONTRIBUTIONS The Granite R2 embedding models were truly the outcome of a successful collaboration across geogra- phies led by Radu Florian - with contributions from IBM Watson Research Lab (WRL) lab and India Research Lab (IRL). Parul Awasthy was the challenge lead on the project overall, calling from WRL, with Jaydeep Sen coo...

2024
[25]

C.5 RUNTIMESPEED FORMODERNBERTMODELS In late April 2026, we observed that our ModernBERT-based embedding models had roughly halved in throughput, with no corresponding change in our own training or inference code. Initial debugging explored several plausible hypotheses — attention backend fallback ordering (whether the model was silently dropping from Fla...

2026