{"paper":{"title":"EmbeddingGemma: Powerful and Lightweight Text Representations","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 300 million parameter model reaches state-of-the-art text embedding results on MTEB","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Aashi Jain, Abheesht Sharma, Adam Roberts, Adham Elarabawy, AJ Co, Alice Lisak, Andreas Doumanoglou, Armand Joulin, Babak Samari, Ben Hora, Biao Zhang, Brian Potetz, Cormac Brick, Dahun Kim, Daniel Cer, Daniel Salz, Divyashree Sreepathihalli, Enrique Alfonseca, Fedor Moiseev, Feiyang Chen, Feng Han, Francesco Visin, Frank Palma Gomez, Ga\\\"el Liu, Glenn Cameron, Gus Martins, Gustavo Hern\\'andez \\'Abrego, Henrique Schechter Vera, Hesen Zhang, Hui Hui, Ian Ballantyne, Iftekhar Naim, Jay Han, Jiageng Zhang, Jingxiao Zheng, Jinhyuk Lee, Joe Zou, Juyeong Ji, Jyotinder Singh, Kaifeng Chen, Karan Gill, Kat Black, Kathleen Kenealy, Ke Chen, Koert Chen, Lucas Gonzalez, Madhuri Shanbhogue, Mark Sherwood, Michael Boratko, Michelle Casbon, Min Choi, Mojtaba Seyedhosseini, Olivier Lacombe, Omar Sanseviero, Paul Suganthan, Qin Yin, Raphael Hoffmann, Ravin Kumar, Renjie Wu, Ryan Mullins, Sahil Dua, Sai Meher Karthik Duddu, Sandeep Mariserla, Sara Smoot, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sindhu Raghuram Panyam, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Thomas Mesnard, Tom Duerig, Trevor Walker, Tris Warkentin, Vikram Rao, Waleed Khawaja, Weiyi Wang, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Yunhsuan Sung, Zach Gleicher, Zhe Dong, Zhe Li, Zhongli Ding","submitted_at":"2025-09-24T17:56:51Z","abstract_excerpt":"We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. No"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"EmbeddingGemma (300M) achieves state-of-the-art results on MTEB across multilingual, English, and code domains, outperforming prior top models with fewer than 500M parameters and providing performance comparable to models double its size.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the described training recipe (encoder-decoder initialization, geometric embedding distillation, spread-out regularizer, and checkpoint merging from varied mixtures) is the primary driver of the reported gains rather than data selection, base model scale, or evaluation specifics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A 300 million parameter model reaches state-of-the-art text embedding results on MTEB","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8a0612c7c7bef8e097d66e3e93373adde95106a72dc497a016e798cedc3c2ce7"},"source":{"id":"2509.20354","kind":"arxiv","version":3},"verdict":{"id":"9e545389-4793-429d-899d-ba410b7b62ae","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:02:28.861005Z","strongest_claim":"EmbeddingGemma (300M) achieves state-of-the-art results on MTEB across multilingual, English, and code domains, outperforming prior top models with fewer than 500M parameters and providing performance comparable to models double its size.","one_line_summary":"A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the described training recipe (encoder-decoder initialization, geometric embedding distillation, spread-out regularizer, and checkpoint merging from varied mixtures) is the primary driver of the reported gains rather than data selection, base model scale, or evaluation specifics.","pith_extraction_headline":"A 300 million parameter model reaches state-of-the-art text embedding results on MTEB"},"references":{"count":27,"sample":[{"doi":"","year":2021,"title":"A. Asai, J. Kasai, J. H. Clark, K. Lee, E. Choi, and H. Hajishirzi. Xor qa: Cross-lingual open-retrieval question answering. InProceedings of the 2021 Conference of the North American Chapter of the A","work_id":"c18e600f-830d-440c-9421-3954edd6e0a5","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Small Language Models are the Future of Agentic AI","work_id":"ba0f0305-4a51-48fd-a13f-201439a18f9e","ref_index":2,"cited_arxiv_id":"2506.02153","is_internal_anchor":true},{"doi":"","year":null,"title":"Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge","work_id":"4191e8bf-2d4c-4abf-aba3-948e3a5a2e46","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2502.13595","year":null,"title":"Mmteb: Massive multilingual text embedding benchmark","work_id":"774aa5f1-35ad-4b36-b6cc-5f461cfab347","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SimCSE: Simple Contrastive Learning of Sentence Embeddings","work_id":"e9fab1e4-f443-4963-9f2a-83f772482c00","ref_index":5,"cited_arxiv_id":"2104.08821","is_internal_anchor":true}],"resolved_work":27,"snapshot_sha256":"85b0e77d9801b104eb80bb6b55848eb97b46c8307b9500851ce23eb357c2d487","internal_anchors":11},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}