pith. sign in

arxiv: 2606.18717 · v1 · pith:FKVW6GERnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Pith reviewed 2026-06-26 20:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Turkishmorphologytokenizerword embeddingsneural networksagglutinative languagesreversible tokenizationmorpheme segmentation
0
0 comments X

The pith

Morpheus is a neural model that detects Turkish morpheme boundaries to produce reversible tokenizations and word embeddings in a single pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Morpheus as a neural morpheme-boundary model designed for Turkish, an agglutinative language where meaning is built from sequences of suffixes. It learns per-character boundary probabilities and converts them into exact segments using a differentiable Poisson-binomial dynamic program, ensuring the process is lossless with no normalization required. This yields the lowest bits-per-character rate of 1.425 among reversible tokenizers, roughly doubles morphological alignment scores compared to subword baselines, and reduces GPU memory by about 19 percent. The same model also generates structured embeddings that perform strongly on root-based retrieval and verification tasks. The design trades some contextual flexibility for explicit morphological structure.

Core claim

Morpheus presents a neural morpheme-boundary model for Turkish that functions as both a lossless tokenizer and a word embedder. By employing a differentiable Poisson-binomial dynamic program on per-character boundary probabilities, it produces soft memberships during training and exact segments at inference without any string normalization, guaranteeing that decode(encode(w)) equals w. This results in the lowest bits-per-character among reversible tokenizers at 1.425, double the morphological alignment of subword methods with a MorphScore macro-F1 of 0.61 versus about 0.32, and about 19 percent less GPU memory than 64K-vocabulary subword tokenizers. Frozen vectors from the model excel in lex

What carries the argument

The Morpheus neural network that outputs per-character boundary probabilities, converted by a Poisson-binomial dynamic program into morpheme segments and a structured word embedding.

If this is right

  • Among reversible tokenizers, Morpheus attains the lowest bits-per-character at 1.425.
  • It roughly doubles the gold morphological alignment of the subword family with MorphScore macro-F1 0.61 versus approximately 0.32.
  • It uses about 19 percent less GPU memory than 64K-vocabulary subword tokenizers.
  • Frozen Morpheus vectors lead on lexical retrieval with root-family MAP 0.85 and same-root verification with ROC-AUC 1.00.
  • On context- and inflection-dependent tasks such as NER the heavier contextual encoders remain ahead due to the root-centric geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The lossless property opens direct use in text generation pipelines where standard subword tokenizers break reversibility.
  • The same architecture could be retrained on other agglutinative languages to test whether the boundary-probability approach transfers without language-specific rules.
  • Combining the root-centric embeddings with a lightweight contextual layer might close the gap on tasks that require inflection sensitivity.
  • The reduced memory footprint suggests the tokenizer could support larger batch sizes or longer sequences in downstream training.

Load-bearing premise

That per-character boundary probabilities learned by the neural network correspond sufficiently to true morpheme boundaries in Turkish to produce the claimed alignment and performance gains when converted via the Poisson-binomial dynamic program.

What would settle it

Running the trained model on a fresh set of 500 Turkish words with expert-annotated morpheme boundaries and checking whether the resulting MorphScore macro-F1 falls below 0.5 or the bits-per-character exceeds 1.5.

Figures

Figures reproduced from arXiv: 2606.18717 by Tolga \c{S}akar.

Figure 1
Figure 1. Figure 1: Training dynamics. Left: total train/validation loss. Right: boundary-detection precision, recall, F1, and [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-objective train/validation curves: auxiliary boundary loss, skip-gram (SGNS), root-identity contrastive, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimization regime. Left: cosine learning-rate schedule. Middle: geometric decay of the auxiliary-loss [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Roundtrip accuracy per tokenizer. The re [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Morphological alignment. Left: MorphScore (UD [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Downstream language-model training. Left: training loss versus optimizer step for the param-equalized [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BPC versus generation throughput. Among reversible tokenizers Morpheus is on the quality frontier, trading throughput (higher fertility) for the lowest BPC and morphological structure. vs. 0.89) and on WikiANN NER (macro-F1 0.48 vs. 0.79), the heavier contextual encoders win. This too follows from the architecture, on two counts. First, the very objective that sharpens root geom￾etry collapses the inflecti… view at source ↗
Figure 8
Figure 8. Figure 8: Language-modeling efficiency. Left: BPC at equal [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TR-MMLU tokenization quality: Turkish￾token (%TR) and pure-token (%Pure) rates. Morpheus leads on the frequency-weighted measures. serves the dense semantic index. 5 Discussion One signal, two roles. The results support the paper’s central claim: a single neural morpheme￾boundary model can serve as both a lossless tok￾enizer and a word embedder. The coupling is not incidental—the differentiable Poisson–bin… view at source ↗
Figure 10
Figure 10. Figure 10: Tokenizer throughput, separate from end-to-end generation. Left: encoding speed (chars/s). Right: [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: t-SNE of word embeddings colored by root family, for Morpheus (left), BERTurk (middle), and BGE-M3 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Embedding evaluation across encoders. Morpheus leads on lexical retrieval (MAP) and same-root [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Morpheus, a neural model for Turkish that predicts per-character morpheme-boundary probabilities and converts them via a differentiable Poisson-binomial dynamic program into soft segments for training and exact, reversible segments at inference. It claims to function simultaneously as a lossless tokenizer (decode(encode(w)) = w by construction) and a word embedder, reporting the lowest BPC (1.425) among reversible tokenizers, roughly doubled MorphScore macro-F1 (0.61 vs. ~0.32), ~19% lower GPU memory than 64K subword models, and leading results on root-family MAP (0.85) and same-root verification (ROC-AUC 1.00) when embeddings are frozen.

Significance. If the central claims hold after verification of training details and gold-alignment construction, the work would supply a morphology-aware, reversible alternative to corpus-statistic subword tokenizers for agglutinative languages, with direct benefits for generation fidelity and root-centric lexical representations.

major comments (2)
  1. [Abstract] The abstract states that the model is trained on per-character boundary probabilities but provides no information on the supervision signal (supervised morphological labels vs. unsupervised objectives), loss terms, or how gold morphological alignments for MorphScore were constructed. This information is required to determine whether the reported MorphScore and embedding gains follow from genuine morpheme recovery or from properties of the Poisson-binomial DP plus neural architecture.
  2. [Abstract] The headline embedding results (root-family MAP 0.85, ROC-AUC 1.00) are obtained with frozen Morpheus vectors; the manuscript must clarify whether these vectors are the structured embeddings emitted by the same forward pass that produces the segments, and whether any additional projection or pooling is applied before the retrieval and verification tasks.
minor comments (1)
  1. [Abstract] The abstract cites specific numeric results (BPC 1.425, MorphScore 0.61, memory reduction ~19%) without referencing the corresponding tables or experimental sections that would allow direct verification of the baselines and evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the points raised.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that the model is trained on per-character boundary probabilities but provides no information on the supervision signal (supervised morphological labels vs. unsupervised objectives), loss terms, or how gold morphological alignments for MorphScore were constructed. This information is required to determine whether the reported MorphScore and embedding gains follow from genuine morpheme recovery or from properties of the Poisson-binomial DP plus neural architecture.

    Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3.2) specifies supervised training on per-character morpheme-boundary labels produced by the Zemberek morphological analyzer for Turkish, with a binary cross-entropy loss on the boundary probabilities combined with the differentiable Poisson-binomial DP objective. Gold alignments for MorphScore are obtained by character-level alignment of the analyzer's morpheme segmentations. We will expand the abstract with a brief clause noting the supervised boundary supervision and the analyzer-based gold construction. revision: yes

  2. Referee: [Abstract] The headline embedding results (root-family MAP 0.85, ROC-AUC 1.00) are obtained with frozen Morpheus vectors; the manuscript must clarify whether these vectors are the structured embeddings emitted by the same forward pass that produces the segments, and whether any additional projection or pooling is applied before the retrieval and verification tasks.

    Authors: The reported vectors are the structured embeddings emitted directly by the same neural forward pass that produces the per-character boundary probabilities (i.e., the word-encoder output prior to the DP). No additional projection or pooling is applied before the retrieval and verification tasks. We will add an explicit clarification sentence to the abstract and the experimental section describing the embedding extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are measured outcomes, not reductions by construction

full rationale

The abstract and claims contain no equations, derivations, or self-citations that reduce reported quantities (BPC 1.425, MorphScore 0.61, MAP 0.85, ROC-AUC 1.00) to fitted parameters or prior author results. Reversibility is stated as holding by construction via the DP, which is a design property rather than a circular prediction of performance. All headline numbers are presented as external evaluations against gold morphological alignments and retrieval benchmarks, with no load-bearing step that collapses to self-definition or fitted-input renaming. The derivation chain is therefore self-contained against the stated external metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the domain assumption that boundary probabilities can be learned to match morphological structure and on the technical assumption that the dynamic program converts those probabilities into segments without information loss.

axioms (1)
  • domain assumption Per-character boundary probabilities produced by the neural network align with true morpheme boundaries sufficiently to improve alignment metrics.
    Invoked when the dynamic program is said to turn probabilities into soft memberships and exact segments.

pith-pipeline@v0.9.1-grok · 5886 in / 1456 out tokens · 35690 ms · 2026-06-26T20:51:43.851947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 1 canonical work pages

  1. [1]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Neural Machine Translation of Rare Words with Subword Units , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  2. [2]

    Kudo, Taku and Richardson, John , booktitle =

  3. [3]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , author =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  4. [4]

    ACM Transactions on Speech and Language Processing , volume =

    Unsupervised Models for Morpheme Segmentation and Morphology Learning , author =. ACM Transactions on Speech and Language Processing , volume =

  5. [5]

    Structure , year =

    Zemberek, an Open Source NLP Framework for Turkic Languages , author =. Structure , year =

  6. [6]

    2020 , howpublished =

    Schweter, Stefan , year =. doi:10.5281/zenodo.3770924 , url =

  7. [7]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

  8. [8]

    Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng , journal =

  9. [9]

    Su, Jianlin and Lu, Yu and Pan, Shengfeng and Murtadha, Ahmed and Wen, Bo and Liu, Yunfeng , journal =

  10. [10]

    Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (SIGPHON) , pages =

    Unsupervised Discovery of Morphemes , author =. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (SIGPHON) , pages =

  11. [11]

    Impact of Tokenization on Language Models: An Analysis for

    Toraman, Cagri and Yilmaz, Eyup Halit and. Impact of Tokenization on Language Models: An Analysis for. ACM Transactions on Asian and Low-Resource Language Information Processing , volume =

  12. [12]

    Effect of Tokenization Granularity for

    Kaya, Yi. Effect of Tokenization Granularity for. Intelligent Systems with Applications , volume =

  13. [13]

    Altinok, Duygu , journal =. Optimal

  14. [16]

    Ali and Fincan, Ali Arda and G

    Bayram, M. Ali and Fincan, Ali Arda and G. Tokenization Standards and Evaluation in Natural Language Processing: A Comparative Analysis of Large Language Models on. 2025 33rd Signal Processing and Communications Applications Conference (SIU) , year =

  15. [18]

    Gulgonul, Senol , year =

  16. [19]

    Ahmet Af s n Ak n and Mehmet D \"u ndar Ak n. 2007. Zemberek, an open source NLP framework for Turkic languages. Structure

  17. [20]

    Duygu Altinok. 2026. Optimal Turkish subword strategies at scale: Systematic evaluation of data--vocabulary--morphology interplay. arXiv preprint arXiv:2602.06942

  18. [21]

    Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, Sava s Y ld r m, and Demircan C elik

    M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, Sava s Y ld r m, and Demircan C elik. 2025a. Tokens with meaning: A hybrid tokenization approach for Turkish. arXiv preprint arXiv:2508.14292

  19. [22]

    Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, and Sava s Y ld r m

    M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, and Sava s Y ld r m. 2025b. Tokenization standards for linguistic integrity: Turkish as a benchmark. arXiv preprint arXiv:2502.07057

  20. [23]

    Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, and Sava s Y ld r m

    M. Ali Bayram, Ali Arda Fincan, Ahmet Semih G \"u m \"u s , Sercan Karaka s , Banu Diri, and Sava s Y ld r m. 2025c. Tokenization standards and evaluation in natural language processing: A comparative analysis of large language models on Turkish. In 2025 33rd Signal Processing and Communications Applications Conference (SIU). IEEE

  21. [24]

    Ali Bayram, Banu Diri, and Sava s Y ld r m

    M. Ali Bayram, Banu Diri, and Sava s Y ld r m. 2026. Adapting multilingual embedding models to Turkish via cross-lingual tokenizer surgery and offline distillation. arXiv preprint arXiv:2605.29992

  22. [25]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3 -embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216

  23. [26]

    Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning (SIGPHON), pages 21--30

  24. [27]

    Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1):1--34

  25. [28]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171--4186

  26. [29]

    Senol Gulgonul. 2025. HeceTokenizer : A syllable-based tokenization approach for Turkish retrieval. Preprint

  27. [30]

    C \"u neyd Tantu g

    Yi g it Bekir Kaya and A. C \"u neyd Tantu g . 2024. Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21:200335

  28. [31]

    Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of ACL, pages 66--75

  29. [32]

    Taku Kudo and John Richardson. 2018. SentencePiece : A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP: System Demonstrations, pages 66--71

  30. [33]

    Stefan Schweter. 2020. BERTurk -- BERT models for Turkish. Zenodo

  31. [34]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of ACL, pages 1715--1725

  32. [35]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. RoFormer : Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864

  33. [36]

    Cagri Toraman, Eyup Halit Yilmaz, Furkan S ah nu c , and Oguzhan Ozcelik. 2023. Impact of tokenization on language models: An analysis for Turkish. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4):1--21