Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
Pith reviewed 2026-06-29 07:32 UTC · model grok-4.3
The pith
A 200M-parameter Turkish sentence embedding model built via tokenizer surgery and offline distillation outperforms its 300M-parameter teacher on Turkish semantic tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary through pruning and frequency-based token addition, cloning the teacher while preserving backbone weights and using mean-composition for the new embedding table, then performing offline embedding distillation with a cosine similarity objective over a balanced 40-language Wikipedia corpus, the resulting 200M-parameter student model achieves Pearson/Spearman correlations of 77.55%/77.45% on STSbTR (exceeding the teacher's 73.84%/72.92%) and a mean score of 63.9% on TR-MTEB (ranking 7th out of 26 models).
What carries the argument
Three-stage pipeline of cross-lingual tokenizer surgery via pruning and frequency analysis, mean-composition embedding table initialization, and offline cosine-similarity distillation from precomputed teacher vectors.
If this is right
- The adapted model supports an 8192-token context window, exceeding the 512-token limit of prior BERT-based Turkish encoders.
- Training finishes in roughly four hours on a single GPU at a total cost of $5-$20 by avoiding online teacher inference.
- The approach yields a competitive cost-quality trade-off with 33% fewer parameters than the teacher.
- All model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling are released for reproducibility.
Where Pith is reading between the lines
- The same tokenizer surgery and offline distillation steps could be applied to adapt the teacher to other languages using the same 40-language corpus.
- Releasing the precomputed teacher embeddings on Wikipedia enables independent experiments or extensions by researchers without access to the original teacher model.
- The method's avoidance of joint teacher-student training during distillation may allow scaling the approach to larger context windows or additional languages at modest cost.
- Testing the pipeline on non-embedding tasks such as classification or retrieval could reveal whether the transferred knowledge generalizes beyond sentence similarity.
Load-bearing premise
Mean-composition initialization of the new embedding table combined with offline cosine-similarity distillation on the 40-language Wikipedia corpus will transfer the teacher's semantic knowledge to the Turkish-optimized tokenizer without substantial degradation on Turkish-specific downstream tasks.
What would settle it
If the student model records a Pearson correlation below the teacher's 73.84% on STSbTR or ranks outside the top 10 on TR-MTEB, the effectiveness of the tokenizer surgery and offline distillation pipeline would be refuted.
Figures
read the original abstract
Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces embeddingmagibu-200m, a 200M-parameter Turkish sentence embedding model with 768-dim L2-normalized vectors and 8192-token context. It adapts a 300M multilingual teacher via a three-stage pipeline: (1) pruning/augmenting the tokenizer to a 131k Turkish-optimized vocabulary using frequency analysis on a 40-language corpus, (2) cloning the transformer backbone while initializing the new embedding table via mean-composition mapping, and (3) offline cosine-similarity distillation on precomputed teacher embeddings from balanced Wikipedia data. The student is claimed to achieve Pearson/Spearman correlations of 77.55%/77.45% on STSbTR (surpassing the teacher's 73.84%/72.92%) and a mean score of 63.9% on TR-MTEB (7th of 26 models), at low cost (~4 hours on one GPU) with all artifacts released.
Significance. If the empirical claims hold, the work offers a practical, low-cost method for language-specific adaptation of embedding models without full pretraining or online teacher inference during distillation. The emphasis on releasing model weights, tokenizer files, precomputed datasets, and open-source tooling strengthens reproducibility and enables downstream use for Turkish semantic tasks.
major comments (3)
- [Abstract and §4] Abstract and §4 (results): The reported STSbTR correlations (77.55%/77.45%) and TR-MTEB mean (63.9%) are presented without error bars, number of evaluation runs, statistical significance tests, or details on the evaluation protocol (e.g., splits or pooling), which is load-bearing for the central claim of outperformance over the 300M teacher.
- [§3.1-3.2] §3.1-3.2: The mean-composition initialization for the augmented 131k vocabulary (new Turkish tokens added via frequency analysis) is described as the key transfer step, but no ablation or analysis quantifies how well averaging approximates embeddings for high-frequency Turkish-specific tokens; this directly affects whether the reported gains can be attributed to the pipeline rather than initialization artifacts.
- [§3.3] §3.3: Offline distillation uses only cosine similarity on precomputed teacher vectors from the 40-language corpus with no online gradients or auxiliary losses; without evidence that this corrects potential misalignment from the mean-composition step on Turkish tokens, the transfer-without-degradation assumption remains untested and central to the superiority claim.
minor comments (2)
- [§3] The context window extension to 8192 tokens is highlighted but no ablation shows its impact on the reported Turkish tasks (most of which likely use shorter inputs).
- [§4] Table or figure captions for TR-MTEB results should explicitly list the 26 tasks and the ranking methodology to allow direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (results): The reported STSbTR correlations (77.55%/77.45%) and TR-MTEB mean (63.9%) are presented without error bars, number of evaluation runs, statistical significance tests, or details on the evaluation protocol (e.g., splits or pooling), which is load-bearing for the central claim of outperformance over the 300M teacher.
Authors: We agree that the lack of error bars, run counts, and protocol details weakens the presentation of the central empirical claims. In the revised manuscript we will expand the evaluation section to specify the exact STSbTR splits, pooling method, and any other protocol details. We will also report the number of runs performed and add error bars (or explicitly note single-run results with justification) along with a basic significance comparison where feasible. These changes directly address the robustness concern. revision: yes
-
Referee: [§3.1-3.2] §3.1-3.2: The mean-composition initialization for the augmented 131k vocabulary (new Turkish tokens added via frequency analysis) is described as the key transfer step, but no ablation or analysis quantifies how well averaging approximates embeddings for high-frequency Turkish-specific tokens; this directly affects whether the reported gains can be attributed to the pipeline rather than initialization artifacts.
Authors: The mean-composition mapping is a pragmatic transfer mechanism that reuses existing multilingual embeddings. While the original submission did not contain a dedicated ablation isolating its contribution, the end-to-end results (student outperforming the teacher on Turkish tasks) provide indirect support for the full pipeline. To respond to the request for quantification, we will add a short analysis or note in §3.1-3.2 examining embedding similarity for a sample of high-frequency Turkish tokens after initialization. revision: partial
-
Referee: [§3.3] §3.3: Offline distillation uses only cosine similarity on precomputed teacher vectors from the 40-language corpus with no online gradients or auxiliary losses; without evidence that this corrects potential misalignment from the mean-composition step on Turkish tokens, the transfer-without-degradation assumption remains untested and central to the superiority claim.
Authors: The offline cosine objective was selected precisely to enable low-cost adaptation without repeated teacher inference. The fact that the 200M student exceeds the 300M teacher on STSbTR supplies empirical evidence that any initialization misalignment is not performance-limiting for Turkish. In the revision we will expand §3.3 with an explicit discussion of this assumption, referencing the observed gains and the balanced multilingual training data as mitigating factors. revision: yes
Circularity Check
No circularity: empirical adaptation pipeline is self-contained
full rationale
The paper presents a three-stage engineering pipeline (tokenizer pruning/augmentation, mean-composition embedding initialization, offline cosine distillation) whose outputs are measured on external benchmarks (STSbTR, TR-MTEB). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. Performance numbers are direct empirical measurements, not quantities forced by construction from the same inputs. The derivation is therefore independent of its own results.
Axiom & Free-Parameter Ledger
free parameters (2)
- new vocabulary size =
131072
- context window length =
8192
axioms (2)
- domain assumption Mean composition of original token embeddings yields a usable initialization for the new vocabulary's embedding table.
- domain assumption Cosine-similarity matching to precomputed teacher embeddings on a balanced 40-language Wikipedia corpus transfers semantic capability to the student.
Forward citations
Cited by 1 Pith paper
-
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Morpheus is a morphology-aware neural tokenizer and embedder for Turkish that achieves lossless reversible segmentation, higher morphological alignment, lower bits-per-character, and competitive root-centric embedding...
Reference graph
Works this paper leans on
-
[1]
Morphscore: Evaluating morpho- logical awareness of tokenizers across languages
Catherine Arnett, Marisa Hudspeth, and Brendan O’Connor. Morphscore: Evaluating morpho- logical awareness of tokenizers across languages. https://arxiv.org/abs/2507.06378, 2025
-
[2]
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümü¸ s, Sercan Karaka¸ s, Banu Diri, Sava¸ s Yıldırım, and Demircan Çelik. Tokens with meaning: A hybrid tokenization approach for nlp. https://arxiv.org/abs/2508.14292, 2025
-
[3]
Baysan and Tunga Güngör
Selman M. Baysan and Tunga Güngör. Tr-mteb: A comprehensive benchmark and embed- ding model suite for turkish sentence representations. https://aclanthology.org/2025. findings-emnlp.471/, 2025
2025
-
[4]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.https://arxiv.org/abs/2402.03216, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Turkembed: Turkish embedding model on nli & sts tasks.https://arxiv.org/abs/2511.08376, 2025
Özay Ezerceli, Gizem Gümü¸ sçekiçci, Tu˘gba Erkoç, and Berke Özenç. Turkembed: Turkish embedding model on nli & sts tasks.https://arxiv.org/abs/2511.08376, 2025
-
[6]
Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. https: //aclanthology.org/2022.naacl-main.293/, 2022
2022
-
[7]
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.https://arxiv.org/abs/2210.07316, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, and Mark Fishel. Teaching old tokenizers new words: Efficient tokenizer adaptation for pre-trained models. https://arxiv.org/abs/ 2512.03989, 2025
-
[9]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.https://arxiv.org/abs/1908.10084, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[10]
Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation.https://arxiv.org/abs/2004.09813, 2020
-
[11]
How does a language-specific tokenizer affect llms?https://arxiv.org/abs/2502.12560, 2025
Jean Seo, Jaeyoon Kim, SungJoo Byun, and Hyopil Shin. How does a language-specific tokenizer affect llms?https://arxiv.org/abs/2502.12560, 2025
-
[12]
Ebrar Kızılo˘glu, Onur Güngör, and Susan Üsküdarlı
Melik¸ sah Türker, A. Ebrar Kızılo˘glu, Onur Güngör, and Susan Üsküdarlı. Tabibert: A large- scale modernbert foundation model and unified benchmarking framework for turkish. https: //arxiv.org/abs/2512.23065, 2025
-
[13]
EmbeddingGemma: Powerful and Lightweight Text Representations
Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.https://arxiv.org/abs/2402.05672, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. https://arxiv.org/abs/2407.19669, 2024. A Implementation Code This appendix provides the P...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.