arxiv: 2604.22730 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.CL

Recognition: unknown

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

Hillary Mutisya , John Mugane

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords Bantu languagesneural embeddingscognate identificationProto-Bantu reconstructionmorphological paradigmscross-lingual similaritytransformer modelshistorical linguistics

0 comments

The pith

Transformer models trained on modern Bantu morphological data recover lexical cognates that match established Proto-Bantu reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether embeddings from neural models trained solely on contemporary Bantu paradigms can surface cross-lingual lexical patterns that line up with traditional historical reconstructions. Working with 14 Eastern and Southern Bantu languages, the authors pull noun and verb lemmas, compute shared candidates via cosine similarity, and compare the strongest matches against the Bantu Lexical Reconstructions database. Ten of the eleven highest-ranked noun candidates align with known Proto-Bantu forms, and twelve verb candidates do the same, with patterns replicated by a separate translation model and consistent with Guthrie zone groupings. Readers would care because the work shows that abundant modern data can supply signals previously accessible only through painstaking comparative linguistics.

Core claim

We train a transformer on Bantu morphological paradigms from 14 languages, extract encoder embeddings for noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across five or more languages. Ten of the top eleven noun candidates (90.9 percent) match reconstructed Proto-Bantu forms in BLR3, including *-ntU 'person', *gombe 'cow', and *mUn, while twelve verb cognates align with known roots such as *-bon- 'see'. An independent model confirms the same clusters and recovers phylogenetic structure matching Guthrie classifications at p < 0.01, with all thirteen productive noun classes showing cosine similarity above 0.83 across languages.

What carries the argument

Encoder embeddings from the BantuMorph transformer model, compared via cosine similarity to cluster noun and verb lemmas into cross-lingual cognate candidates.

If this is right

Ninety percent of the strongest noun candidates align with documented Proto-Bantu reconstructions.
Twelve verb roots recovered from modern data match known Proto-Bantu forms and span wide geographic ranges.
An independent translation model reproduces the same cognate clusters and language groupings.
All thirteen productive noun classes maintain high cross-language similarity within class versus between classes.
Results are framed as recovery of shared Bantu structure consistent with Proto-Bantu rather than strict separation of retentions from later innovations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding approach could be tested on other language families that have rich modern morphological resources but sparse historical records.
Incorporating explicit contact or borrowing signals might eventually separate ancient retentions from regional innovations.
The method offers a scalable way to propose new candidate cognates for linguist review in under-documented Bantu branches.
If extended to time-stamped data, it could help model the rate at which lexical similarity decays across Bantu subgroups.

Load-bearing premise

High cosine similarity between embeddings mainly reflects shared historical ancestry rather than recent language contact or coincidental overlap.

What would settle it

Finding that fewer than half of the top eleven noun candidates match any entry in the Bantu Lexical Reconstructions database, or that the recovered language clusters contradict established Guthrie zone classifications.

Figures

Figures reproduced from arXiv: 2604.22730 by Hillary Mutisya, John Mugane.

**Figure 1.** Figure 1: MDS projection of BantuMorph embedding distances, colored by Guthrie zone. Same-zone languages cluster together: Ezone (kik, kam, mer), J-zone (kin, run, lug), S-zone (zul, xho, sna, nso). The E-zone forms the tightest cluster (mean pairwise similarity 0.990). Kirundi, Luganda) cluster in the center-top; and the S-zone languages (Zulu, Xhosa, Shona, N. Sotho) occupy the lower left. Same-zone cosine simila… view at source ↗

**Figure 2.** Figure 2: Ward-linkage dendrogram from BantuMorph embedding distances. Label colors indicate Guthrie zone (see view at source ↗

**Figure 3.** Figure 3: Reference language family tree (Glottolog classification) for the 14 Bantu languages in our study, grouped by Guthrie zone. Our embedding-derived dendrogram ( view at source ↗

read the original abstract

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows transformer embeddings from modern Bantu morphology recover many cognates matching Proto-Bantu reconstructions at high rates, with careful scoping and independent checks.

read the letter

The main point is that a transformer trained only on modern Bantu morphological paradigms can surface hundreds of noun and verb cognate candidates that align well with known Proto-Bantu forms from BLR3. They pull 728 noun and 1,525 verb candidates across 14 Eastern and Southern languages, then confirm 10 of the top 11 nouns (90.9%) match reconstructions like *-ntU and *gombe, plus a dozen verbs including *-bon- and *-jIm- that span wide ranges. Cross-checks with NLLB-600M embeddings and phylogenetic tests against Guthrie zones both hit p<0.01, and noun-class similarities stay above 0.83 within classes. That is the concrete result worth noting. What is new is the scale of candidate extraction via embeddings on BantuMorph data plus the dual-model validation. It does well by sticking to external databases for confirmation and by stating outright that the setup cannot separate retentions from later innovations or contact. The scoping keeps the claim to 'consistent with' Proto-Bantu rather than overreaching. The soft spots are minor but real: the abstract gives no architecture or training details, so false-positive controls are hard to judge from the summary alone, and the cosine-similarity assumption still rests on the model not latching onto modern resemblances. The full text apparently supplies the pipeline without internal contradictions. This is for historical linguists working on Bantu or anyone testing embedding methods on under-documented families. It has enough independent validation and concrete numbers to deserve a serious referee.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that a transformer model (BantuMorph v7) trained solely on modern morphological paradigms from 14 Eastern and Southern Bantu languages can recover shared lexical structure via encoder embeddings and cosine-similarity-based cognate candidate extraction (728 nouns, 1,525 verbs across 5+ languages). Validation against the independent BLR3 database shows 10 of the top 11 noun candidates (90.9%) align with reconstructed Proto-Bantu forms (e.g., *-ntU, *gombe, *mUn), with 12 verb alignments (e.g., *-bon-, *-jIm-); cross-validation with NLLB-600M embeddings confirms phylogenetic groupings consistent with Guthrie zones (p < 0.01) and high within-class noun-class similarities (>0.83, p < 10^-9). The authors explicitly scope results to 'consistent with Proto-Bantu' rather than claiming isolation of retentions from innovations or contact effects.

Significance. If the pipeline holds, the work demonstrates a scalable neural approach to historical linguistics that leverages modern data and independent external resources (BLR3, ASJP, NLLB-600M) for validation, yielding falsifiable alignments and statistical phylogenetic tests. This could extend to other language families and provides a concrete example of embeddings capturing diachronic signals without ancient texts.

minor comments (3)

§3.1: Provide the precise cosine-similarity threshold and any minimum-frequency filters used to arrive at the 728/1,525 candidate counts, along with a sensitivity check showing how the top-11 alignment rate changes under modest threshold variation.
§4.2: The phylogenetic consistency test (p < 0.01) should specify the exact distance metric, clustering algorithm, and permutation procedure for the null distribution to allow direct replication.
Table 1: Add a random-baseline column (e.g., expected matches under shuffled embeddings) so readers can gauge whether the 90.9% noun alignment exceeds chance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of the manuscript and for recommending minor revision. The report correctly captures our use of BantuMorph v7 embeddings on modern data to extract cognate candidates (728 nouns, 1,525 verbs), the high alignment rates with BLR3 reconstructions, cross-validation with NLLB-600M, and the explicit scoping to results 'consistent with Proto-Bantu' without claiming isolation of retentions from innovations or contact. No specific major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external validation

full rationale

The paper trains a transformer (BantuMorph v7) exclusively on modern morphological paradigms from 14 Eastern/Southern Bantu languages, extracts encoder embeddings, and selects cognate candidates by cosine similarity thresholds across 5+ languages. These candidates are then matched post-hoc against independent external resources (BLR3 Proto-Bantu reconstructions, ASJP vocabulary) and cross-validated with a separate NLLB-600M model; phylogenetic groupings are tested for consistency with Guthrie zones (p<0.01). No equations or steps define the target historical forms in terms of the model's outputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or author-specific uniqueness theorems. The paper explicitly scopes results to 'consistent with' reconstructions and flags inability to separate retentions from contact/innovation, keeping the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard NLP assumptions about what embeddings encode and linguistic assumptions about cognates; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Cosine similarity between encoder embeddings reflects historical cognate relatedness across languages
Used to identify and rank cognate candidates from modern data
domain assumption Alignment with BLR3 and ASJP databases validates historical recovery
Central to the confirmation of 10/11 top noun candidates

pith-pipeline@v0.9.0 · 5599 in / 1300 out tokens · 93251 ms · 2026-05-08T12:16:27.951643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

Online: https://www.africamuseum.be/ en/research/discover/human_sciences/ culture_society/blr

Royal Museum for Central Africa, Ter- vuren. Online: https://www.africamuseum.be/ en/research/discover/human_sciences/ culture_society/blr. Johannes Bjerva and Isabelle Augenstein. Does Typological Information Transfer Cross-Lingually? InProceedings of NAACL-HLT 2021,

2021
[2]

Chi, John Hewitt, and Christopher D

Ethan A. Chi, John Hewitt, and Christopher D. Manning. Find- ing Universal Grammatical Relations in Multilingual BERT. InProceedings of ACL 2020, pages 5564–5577,

2020