arxiv: 2604.10665 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.IR

Recognition: unknown

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Senol Gulgonul

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords Turkishsyllable tokenizationretrievalBERTOOV-free vocabularyphonological patternsTQuADchunk-based retrieval

0 comments

The pith

A syllable tokenizer built on Turkish's fixed phonological patterns lets a 1.5M-parameter model beat a 200-times-larger morphology baseline on retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

HeceTokenizer exploits the six regular phonological patterns of Turkish to define a closed vocabulary of roughly 8,000 syllable types that covers every possible word without out-of-vocabulary tokens. The authors train a BERT-tiny encoder from scratch on Turkish Wikipedia using masked language modeling and pair it with a fine-grained chunking strategy for retrieval. On the TQuAD benchmark this setup reaches 50.3 percent Recall@5, exceeding the 46.92 percent achieved by a morphology-driven baseline whose model is two hundred times larger. The work demonstrates that language-specific phonological regularity can serve as a lightweight inductive bias that reduces both vocabulary size and model scale for Turkish retrieval tasks.

Core claim

HeceTokenizer constructs an out-of-vocabulary-free vocabulary of approximately 8,000 syllable types from Turkish's deterministic six-pattern phonological structure, trains a 1.5-million-parameter BERT-tiny model on masked language modeling, and, when combined with chunk-based retrieval, achieves 50.3 percent Recall@5 on TQuAD, surpassing the 46.92 percent of a morphology-driven baseline that uses a model two hundred times larger.

What carries the argument

HeceTokenizer, which maps any Turkish text to tokens drawn from a closed set of syllables generated by six fixed phonological patterns.

If this is right

A closed syllable vocabulary removes the need for special handling of unknown tokens throughout Turkish retrieval pipelines.
Phonological regularity can substitute for morphological analysis rules and for very large parameter counts in retrieval settings.
Fine-grained chunking combined with syllable tokens improves recall without requiring additional model capacity.
The same phonological structure supplies an inductive bias that remains effective even when the encoder is trained from scratch on modest Wikipedia data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same six-pattern approach could be adapted to other languages whose syllable inventories are similarly constrained and regular.
Deployment costs for Turkish search or question-answering systems could drop because both vocabulary and model size stay small.
The result raises the possibility that explicit linguistic structure can offset the need for scale in other agglutinative languages.

Load-bearing premise

Turkish phonology consists of exactly six deterministic patterns that together produce a fixed set of about 8,000 syllable types covering every word that appears in real text.

What would settle it

A large sample of contemporary Turkish text that contains words or sequences impossible to segment into the 8,000 defined syllable types, or an independent run of the TQuAD evaluation that yields Recall@5 at or below 46.92 percent.

read the original abstract

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces HeceTokenizer, a syllable-based tokenizer for Turkish that exploits the language's deterministic six-pattern phonological structure to construct a closed, OOV-free vocabulary of approximately 8,000 syllable types. A 1.5M-parameter BERT-tiny model is trained from scratch on a Turkish Wikipedia subset using masked language modeling and evaluated on the TQuAD retrieval benchmark. Combined with a fine-grained chunk-based retrieval strategy, it reports 50.3% Recall@5, outperforming a morphology-driven baseline that uses a model 200 times larger (46.92% Recall@5). The authors conclude that Turkish's phonological regularity supplies a strong, resource-light inductive bias for retrieval.

Significance. If the OOV-free property and experimental comparisons hold under scrutiny, the work would be significant for showing that language-specific phonological structure can yield compact, high-performing tokenizers and models, enabling small-scale systems to surpass much larger morphology-based approaches on Turkish retrieval. This could encourage similar linguistically grounded tokenization strategies in other languages with regular phonotactics and support efficiency-focused NLP research.

major comments (2)

[Abstract] Abstract: The central performance claim (50.3% Recall@5 with HeceTokenizer + chunking vs. 46.92% for the 200x-larger baseline) is presented without any description of data splits, pretraining corpus size or subset details, training procedure, hyper-parameters, or the precise chunking strategy. These omissions are load-bearing because they prevent assessment of whether the comparison is controlled and whether gains are attributable to the syllable tokenizer rather than unablated factors.
[Abstract] Abstract: The claim that the six phonological patterns produce a truly closed, OOV-free vocabulary of ~8,000 syllable types covering all Turkish text is asserted without evidence of coverage verification, exception handling, or analysis of real-world violations (loanwords, proper names, abbreviations, numbers, or code-switching). This assumption underpins the attribution of inductive bias and performance gains to phonological regularity; its unverified status weakens the central argument.

minor comments (1)

[Abstract] Abstract: The six patterns are referenced but never enumerated or defined; adding a brief explicit list (e.g., CV, CVC, etc.) would improve clarity for readers unfamiliar with Turkish phonology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback, which has helped us improve the clarity and rigor of our work. We address each major comment in detail below and have made revisions to the manuscript as indicated.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (50.3% Recall@5 with HeceTokenizer + chunking vs. 46.92% for the 200x-larger baseline) is presented without any description of data splits, pretraining corpus size or subset details, training procedure, hyper-parameters, or the precise chunking strategy. These omissions are load-bearing because they prevent assessment of whether the comparison is controlled and whether gains are attributable to the syllable tokenizer rather than unablated factors.

Authors: We agree that the abstract, in its current concise form, omits key experimental details that would allow readers to fully assess the controlled nature of the comparison. The full manuscript provides these specifics in Sections 4 (pretraining on a Turkish Wikipedia subset with standard splits and MLM objective) and 5 (hyperparameters for the 1.5M-parameter model and the overlapping chunk-based retrieval strategy). To address the concern directly, we have revised the abstract to include a brief summary of the pretraining corpus, training procedure, and chunking approach, ensuring the performance claims are presented with necessary context. revision: yes
Referee: [Abstract] Abstract: The claim that the six phonological patterns produce a truly closed, OOV-free vocabulary of ~8,000 syllable types covering all Turkish text is asserted without evidence of coverage verification, exception handling, or analysis of real-world violations (loanwords, proper names, abbreviations, numbers, or code-switching). This assumption underpins the attribution of inductive bias and performance gains to phonological regularity; its unverified status weakens the central argument.

Authors: We acknowledge that the initial manuscript asserts the closed vocabulary property based on the deterministic six-pattern phonological structure without providing explicit coverage statistics or exception analysis. This is a fair point that strengthens the need for supporting evidence. In the revised version, we have added a dedicated paragraph in Section 3 detailing the vocabulary construction process and a coverage verification study on both the pretraining corpus and a diverse held-out set that includes loanwords and code-switched text, along with a description of fallback handling for rare violations. This addition directly supports the inductive bias argument. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical tokenizer comparison with no derivations or self-referential reductions

full rationale

The paper contains no equations, parameter fittings, or derivation chains that could reduce a claimed result to its own inputs by construction. It describes a syllable tokenizer built from six phonological patterns, trains a small BERT-tiny model, and reports a direct Recall@5 comparison against an external morphology baseline on TQuAD. These steps are observational and falsifiable against held-out data; the OOV-free claim is a stated premise about Turkish phonology rather than a quantity derived from the reported metrics. No self-citations are invoked as load-bearing uniqueness theorems, and the performance numbers are not renamed fitted quantities. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Turkish syllables follow exactly six deterministic phonological patterns that permit a fixed, exhaustive vocabulary.

axioms (1)

domain assumption Turkish has a deterministic six-pattern phonological structure for syllables.
Explicitly invoked in the abstract as the foundation for constructing the closed 8,000-syllable vocabulary.

pith-pipeline@v0.9.0 · 5422 in / 1313 out tokens · 45495 ms · 2026-05-10T16:30:53.863222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages

[1]

Impact of tokenization on language models: An analysis for Turkish

Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, and Oguzhan Ozcelik. Impact of tokenization on language models: An analysis for Turkish. ACM Transactions on Asian and Low -Resource Language Information Processing, 22(4):1–21, April 2023

2023
[2]

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Batuhan Baykara and Tunga Güngör. Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian. Language Resources and Evaluation, 56(3):973–1007, September 2022

2022
[3]

Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, and Demircan Çelik

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, and Demircan Çelik. Tokens with Meaning: A Hybrid Tokenization Approach for Turkish. arXiv:2508.14292v3, March 2026

work page arXiv 2026
[4]

Hece Yapısı ve Satır Sonunda Kelimelerin Bölünmesi

Türk Dil Kurumu. Hece Yapısı ve Satır Sonunda Kelimelerin Bölünmesi. https://tdk.gov.tr/icerik/yazim-kurallari/hece-yapisi-ve-satir-sonunda-kelimelerin-bolunmesi/, 2024

2024
[5]

Büyükkuşcu and E

İ. Büyükkuşcu and E. Adalı. Heceleme Yöntemiyle Kök Sözcük Üretme. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, Cilt: 2, Sayı: 1, 2016

2016
[6]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725, Berlin, Germany, August 2016

2016
[7]

Japanese and korean voice search

Mike Schuster and Kaisuke Nakajima. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152, March 2012

2012
[8]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv:1804.10959, April 2018

work page Pith review arXiv 2018
[9]

Cüneyd Tantug

Yigit Bekir Kaya and A. Cüneyd Tantug. Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21:200335, March 2024

2024
[10]

Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words

Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze. Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pages 3594–3608, 2021

2021
[11]

MorphPiece: A linguistic tokenizer for large language models

Haris Jabbar. MorphPiece: A linguistic tokenizer for large language models. arXiv:2307.07262, February 2024

work page arXiv 2024
[12]

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

Ehsaneddin Asgari, Yassine El Kheir, and Mohammad Ali Sadraei Javaheri. MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies. arXiv:2502.00894, February 2025

work page arXiv 2025
[13]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP 2018 System Demonstrations, pages 66–71, Brussels, Belgium, November 2018

2018
[14]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL -HLT 2019, pages 4171–4186, Minneapolis, Minnesota, June 2019

2019
[15]

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Senol Gulgonul. HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval. GitHub repository, https://github.com/senolgulgonul/hecetokenizer, 2026

2026