arxiv: 2604.22723 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.CL

Recognition: unknown

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Hillary Mutisya , John Mugane

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords morphological discoverylow-resource languagesBantu languagescross-lingual transferunsupervised clusteringnoun classesGiriamazero-shot learning

0 comments

The pith

Combining Swahili transfer learning with unsupervised clustering discovers noun classes for 2,455 Giriama words and two new morphological patterns from only 91 labeled paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that an ensemble approach can extract morphological structure in a low-resource Bantu language by leveraging vocabulary overlap with a higher-resource relative while letting clustering surface unique features. Transfer from Swahili handles cognate detection across the roughly 60 percent shared vocabulary, and clustering isolates language-specific innovations such as prefix variants that transfer alone would miss. Weighted voting merges the two signals to produce noun-class labels for thousands of words plus concrete documentation of previously unrecorded patterns. This matters for Bantu languages that lack large annotated datasets, because the method turns a tiny seed lexicon into an expanded resource with measurable segmentation and lemmatization performance on held-out data.

Core claim

The pipeline integrates cross-lingual transfer from Swahili with unsupervised clustering through weighted voting; transfer identifies cognates while clustering reveals innovations invisible to transfer. On Giriama it assigns noun classes to 2,455 words and isolates an a- prefix variant for Class 2 produced by vowel coalescence of wa- (95.1 percent consistency) together with a contracted k' prefix (98.5 percent consistency). External checks on 444 known verb paradigms yield 78.2 percent lemmatization accuracy, and expansion to a 19,624-word corpus reaches 97.3 percent segmentation and 86.7 percent lemmatization across major word classes.

What carries the argument

The weighted-voting ensemble of Swahili transfer learning and unsupervised clustering, which assigns complementary roles to cognate detection and discovery of language-specific prefix innovations.

If this is right

The same pipeline can label noun classes and segment words across other low-resource Bantu languages that share substantial vocabulary with Swahili.
A small set of 91 labeled paradigms suffices to bootstrap lexicon expansion to tens of thousands of words with high segmentation accuracy.
New prefix patterns discovered by clustering can be added to morphological descriptions without requiring exhaustive manual annotation.
The released lexicons and code enable direct reuse for documentation projects in related languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar transfer-plus-clustering ensembles might work for other language families that have one relatively well-resourced member and several close relatives with tiny labeled sets.
Reducing the seed set below 91 paradigms would test how far the weighted-voting balance can be pushed before clustering artifacts dominate.
The vowel-coalescence and contraction patterns found here may recur in neighboring dialects, offering a concrete starting point for comparative fieldwork.

Load-bearing premise

That unsupervised clustering on the limited seed data will surface genuine morphological innovations rather than spurious groupings that the weighted vote cannot filter out.

What would settle it

Independent manual review of a random sample of the 2,455 assigned noun classes that finds accuracy substantially below the claimed rates, or failure to locate the reported a- and k' patterns in additional Giriama texts.

Figures

Figures reproduced from arXiv: 2604.22723 by Hillary Mutisya, John Mugane.

**Figure 1.** Figure 1: illustrates our three-component pipeline. 3.1 Problem Formulation Input: • M: Character-level pretrained model (ByT5) • Ls: High-resource source language with noun class labels (Swahili) • Lt: Low-resource target language (Giriama) • Ps = {(wi , ci)} Ns i=1: Labeled paradigms in Ls • Ct: Unlabeled corpus in Lt Output: Noun class assignments Cˆ = {(wj , cˆj , confj )} Nt j=1 for words in Lt. 3.2 Method 1: … view at source ↗

read the original abstract

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a usable transfer-plus-clustering pipeline for Giriama that produces concrete numbers and releases the code, but the unsupervised step lacks the checks needed to confirm the new patterns are real rather than artifacts.

read the letter

The main thing here is a practical pipeline that takes 91 labeled Giriama paradigms, pulls in Swahili transfer, and uses clustering to assign noun classes to 2,455 words while flagging two new prefix patterns. They report 78.2% lemmatization on a 444-verb test set and 97.3% segmentation on the expanded 19k-word corpus, then release the code and lexicons. That resource drop is the part that actually helps people working on similar languages.

Referee Report

3 major / 1 minor

Summary. The paper presents a pipeline combining cross-lingual transfer learning from Swahili with unsupervised clustering for zero-shot morphological discovery in Giriama (a low-resource Bantu language with only 91 labeled paradigms). It claims to discover noun class assignments for 2,455 words and two previously undocumented patterns (a- prefix variant with 95.1% consistency and contracted k' prefix with 98.5% consistency), validated externally by 78.2% lemmatization accuracy on 444 verbs and 97.3% segmentation / 86.7% lemmatization on an expanded 19,624-word corpus. The ensemble uses weighted voting to exploit complementary strengths of transfer (cognate detection via ~60% vocabulary overlap) and clustering (language-specific innovations).

Significance. If the clustering step is shown to surface genuine innovations rather than artifacts, the approach could meaningfully advance morphological documentation for low-resource Bantu languages by requiring minimal supervision. The release of code and discovered lexicons is a concrete strength supporting reproducibility.

major comments (3)

[Methods (unsupervised clustering and ensemble description)] The description of the unsupervised clustering pipeline provides no details on feature representations, distance metrics, linkage criteria, number of clusters, or hyperparameter selection. This omission is load-bearing because the central claim that clustering discovers 'language-specific innovations invisible to transfer' (e.g., the a- prefix variant and k' prefix) rests on the 95.1%/98.5% consistency figures; without these choices it is impossible to rule out corpus artifacts such as orthographic or frequency biases.
[Results and validation] The external validation reports 78.2% lemmatization on 444 known verb paradigms, yet this does not test the noun-class assignments for the 2,455 words or the two undocumented patterns. No ablation comparing the full ensemble against transfer-only or clustering-only baselines is presented, leaving the claim that the weighted-voting combination exploits complementary strengths unsubstantiated.
[Supervision and voting procedure] The weighted-voting step relies on only 91 labeled paradigms as supervision. The manuscript contains no sensitivity analysis, cross-validation, or leakage checks showing that the 2,455-word noun-class assignments and new pattern detections remain stable under small perturbations of this tiny anchor set.

minor comments (1)

[Abstract and Results] The phrase 'v3 corpus expansion' in the abstract and results is undefined; please clarify its meaning and how the 19,624-word corpus was constructed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We have revised the manuscript to provide the requested details, analyses, and clarifications where the original version was insufficient.

read point-by-point responses

Referee: [Methods (unsupervised clustering and ensemble description)] The description of the unsupervised clustering pipeline provides no details on feature representations, distance metrics, linkage criteria, number of clusters, or hyperparameter selection. This omission is load-bearing because the central claim that clustering discovers 'language-specific innovations invisible to transfer' (e.g., the a- prefix variant and k' prefix) rests on the 95.1%/98.5% consistency figures; without these choices it is impossible to rule out corpus artifacts such as orthographic or frequency biases.

Authors: We agree that the original manuscript omitted critical implementation details for the clustering step. In the revised version we have added a new subsection (3.2) that specifies: (i) feature representations as the concatenation of TF-IDF vectors over character 3- to 5-grams and mean-pooled prefix embeddings from a Swahili-pretrained FastText model; (ii) cosine distance; (iii) Ward linkage with a maximum cluster size constraint; (iv) the number of clusters (k=15) chosen by maximizing the silhouette score on a 10% held-out sample of the Giriama corpus; and (v) hyperparameter selection via grid search over linkage and distance variants, with final weights for the ensemble determined by validation accuracy on the 91 labeled paradigms. These additions allow readers to reproduce the pipeline and confirm that the reported consistency figures for the a- and k' patterns exceed what would be expected from orthographic or frequency biases alone. revision: yes
Referee: [Results and validation] The external validation reports 78.2% lemmatization on 444 known verb paradigms, yet this does not test the noun-class assignments for the 2,455 words or the two undocumented patterns. No ablation comparing the full ensemble against transfer-only or clustering-only baselines is presented, leaving the claim that the weighted-voting combination exploits complementary strengths unsubstantiated.

Authors: The referee correctly notes that the verb-only lemmatization result does not directly evaluate the noun-class assignments or the novel patterns. We have added an ablation study (new Table 4) that compares transfer-only, clustering-only, and the weighted-voting ensemble on a held-out set of 500 words drawn from the 2,455-word noun-class discovery set. The ensemble reaches 81.4% noun-class accuracy, versus 64.7% (transfer) and 70.2% (clustering), confirming the complementary contribution. For the two undocumented patterns we now report a targeted manual audit: a native-speaker linguist examined 150 randomly sampled instances of each pattern and confirmed 94.7% and 97.3% adherence, respectively. These results are included in the revised Results section. revision: yes
Referee: [Supervision and voting procedure] The weighted-voting step relies on only 91 labeled paradigms as supervision. The manuscript contains no sensitivity analysis, cross-validation, or leakage checks showing that the 2,455-word noun-class assignments and new pattern detections remain stable under small perturbations of this tiny anchor set.

Authors: We acknowledge the small size of the anchor set and the absence of stability checks in the original submission. The revised manuscript now includes a sensitivity analysis (Section 4.3): we performed 100 bootstrap resamples of the 91 paradigms (sampling with replacement, size 91) and re-ran the full pipeline. Noun-class assignments for the 2,455 words showed a mean Jaccard overlap of 0.91 across resamples; the a- and k' pattern consistency scores remained above 94% in every run. Voting weights were obtained via leave-one-out cross-validation on the 91 paradigms to prevent leakage, and the 91 were never used for direct label propagation. These results are reported together with the original consistency figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external validation

full rationale

The paper describes an empirical method combining cross-lingual transfer from Swahili with unsupervised clustering on Giriama data, validated through reported accuracies on 444 known verb paradigms (78.2% lemmatization), corpus expansion to 19,624 words (97.3% segmentation, 86.7% lemmatization), and consistency metrics for discovered patterns. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed discovery or prediction to the inputs by construction; the central results rest on application to external data and released code rather than definitional equivalence or renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about language relatedness and the power of unsupervised clustering; no explicit free parameters or new entities are named in the abstract.

axioms (2)

domain assumption Cross-lingual transfer learning is effective between Swahili and Giriama due to approximately 60% vocabulary overlap within the Bantu family.
Invoked to justify the transfer component of the ensemble.
domain assumption Unsupervised clustering can discover valid language-specific morphological patterns that transfer learning misses.
Core premise enabling the claim of previously undocumented patterns.

pith-pipeline@v0.9.0 · 5512 in / 1557 out tokens · 71001 ms · 2026-05-08T12:21:26.653351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references

[1]

Buys and J

J. Buys and J. A. Botha. Cross-lingual morphological tag- ging for low-resource languages. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1954–1964,

1954
[2]

Cotterell, C

R. Cotterell, C. Kirov, J. Sylak-Glassman, D. Yarowsky, J. Eis- ner, and M. Hulden. CoNLL-SIGMORPHON 2017 shared task: Universal morphological reinflection in 52 languages. InProceedings of the CoNLL–SIGMORPHON 2017 Shared Task, pages 1–30,

2017
[3]

Cotterell, C

R. Cotterell, C. Kirov, M. Hulden, D. Yarowsky, et al. The CoNLL–SIGMORPHON 2018 shared task: Universal mor- phological reinflection. InProceedings of the CoNLL– SIGMORPHON 2018 Shared Task, pages 1–27,

2018
[4]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL), pages 4171–4186,

2019
[5]

Under review. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word rep- resentations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL), pages 2227–2237,

2018
[6]

Pretorius and S

L. Pretorius and S. E. Bosch. Exploiting cross-linguistic simi- larities in Zulu and Xhosa computational morphology. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages, pages 96–104,

2009
[7]

Vylomova, J

E. Vylomova, J. White, E. Salesky, S. J. Mielke, S. Wu, K. Gor- man, et al. SIGMORPHON 2020 shared task 0: Typologi- cally diverse morphological inflection. InProceedings of the 17th Annual SIGMORPHON Workshop on Computa- tional Research in Phonetics, Phonology, and Morphology, pages 1–39,

2020