Recognition: unknown
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Pith reviewed 2026-05-08 12:16 UTC · model grok-4.3
The pith
Transformer models trained on modern Bantu morphological data recover lexical cognates that match established Proto-Bantu reconstructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train a transformer on Bantu morphological paradigms from 14 languages, extract encoder embeddings for noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across five or more languages. Ten of the top eleven noun candidates (90.9 percent) match reconstructed Proto-Bantu forms in BLR3, including *-ntU 'person', *gombe 'cow', and *mUn, while twelve verb cognates align with known roots such as *-bon- 'see'. An independent model confirms the same clusters and recovers phylogenetic structure matching Guthrie classifications at p < 0.01, with all thirteen productive noun classes showing cosine similarity above 0.83 across languages.
What carries the argument
Encoder embeddings from the BantuMorph transformer model, compared via cosine similarity to cluster noun and verb lemmas into cross-lingual cognate candidates.
If this is right
- Ninety percent of the strongest noun candidates align with documented Proto-Bantu reconstructions.
- Twelve verb roots recovered from modern data match known Proto-Bantu forms and span wide geographic ranges.
- An independent translation model reproduces the same cognate clusters and language groupings.
- All thirteen productive noun classes maintain high cross-language similarity within class versus between classes.
- Results are framed as recovery of shared Bantu structure consistent with Proto-Bantu rather than strict separation of retentions from later innovations.
Where Pith is reading between the lines
- The same embedding approach could be tested on other language families that have rich modern morphological resources but sparse historical records.
- Incorporating explicit contact or borrowing signals might eventually separate ancient retentions from regional innovations.
- The method offers a scalable way to propose new candidate cognates for linguist review in under-documented Bantu branches.
- If extended to time-stamped data, it could help model the rate at which lexical similarity decays across Bantu subgroups.
Load-bearing premise
High cosine similarity between embeddings mainly reflects shared historical ancestry rather than recent language contact or coincidental overlap.
What would settle it
Finding that fewer than half of the top eleven noun candidates match any entry in the Bantu Lexical Reconstructions database, or that the recovered language clusters contradict established Guthrie zone classifications.
Figures
read the original abstract
We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a transformer model (BantuMorph v7) trained solely on modern morphological paradigms from 14 Eastern and Southern Bantu languages can recover shared lexical structure via encoder embeddings and cosine-similarity-based cognate candidate extraction (728 nouns, 1,525 verbs across 5+ languages). Validation against the independent BLR3 database shows 10 of the top 11 noun candidates (90.9%) align with reconstructed Proto-Bantu forms (e.g., *-ntU, *gombe, *mUn), with 12 verb alignments (e.g., *-bon-, *-jIm-); cross-validation with NLLB-600M embeddings confirms phylogenetic groupings consistent with Guthrie zones (p < 0.01) and high within-class noun-class similarities (>0.83, p < 10^-9). The authors explicitly scope results to 'consistent with Proto-Bantu' rather than claiming isolation of retentions from innovations or contact effects.
Significance. If the pipeline holds, the work demonstrates a scalable neural approach to historical linguistics that leverages modern data and independent external resources (BLR3, ASJP, NLLB-600M) for validation, yielding falsifiable alignments and statistical phylogenetic tests. This could extend to other language families and provides a concrete example of embeddings capturing diachronic signals without ancient texts.
minor comments (3)
- §3.1: Provide the precise cosine-similarity threshold and any minimum-frequency filters used to arrive at the 728/1,525 candidate counts, along with a sensitivity check showing how the top-11 alignment rate changes under modest threshold variation.
- §4.2: The phylogenetic consistency test (p < 0.01) should specify the exact distance metric, clustering algorithm, and permutation procedure for the null distribution to allow direct replication.
- Table 1: Add a random-baseline column (e.g., expected matches under shuffled embeddings) so readers can gauge whether the 90.9% noun alignment exceeds chance.
Simulated Author's Rebuttal
We thank the referee for their accurate summary of the manuscript and for recommending minor revision. The report correctly captures our use of BantuMorph v7 embeddings on modern data to extract cognate candidates (728 nouns, 1,525 verbs), the high alignment rates with BLR3 reconstructions, cross-validation with NLLB-600M, and the explicit scoping to results 'consistent with Proto-Bantu' without claiming isolation of retentions from innovations or contact. No specific major comments appear in the provided report.
Circularity Check
No significant circularity; derivation relies on external validation
full rationale
The paper trains a transformer (BantuMorph v7) exclusively on modern morphological paradigms from 14 Eastern/Southern Bantu languages, extracts encoder embeddings, and selects cognate candidates by cosine similarity thresholds across 5+ languages. These candidates are then matched post-hoc against independent external resources (BLR3 Proto-Bantu reconstructions, ASJP vocabulary) and cross-validated with a separate NLLB-600M model; phylogenetic groupings are tested for consistency with Guthrie zones (p<0.01). No equations or steps define the target historical forms in terms of the model's outputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or author-specific uniqueness theorems. The paper explicitly scopes results to 'consistent with' reconstructions and flags inability to separate retentions from contact/innovation, keeping the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cosine similarity between encoder embeddings reflects historical cognate relatedness across languages
- domain assumption Alignment with BLR3 and ASJP databases validates historical recovery
Reference graph
Works this paper leans on
-
[1]
Online: https://www.africamuseum.be/ en/research/discover/human_sciences/ culture_society/blr
Royal Museum for Central Africa, Ter- vuren. Online: https://www.africamuseum.be/ en/research/discover/human_sciences/ culture_society/blr. Johannes Bjerva and Isabelle Augenstein. Does Typological Information Transfer Cross-Lingually? InProceedings of NAACL-HLT 2021,
2021
-
[2]
Chi, John Hewitt, and Christopher D
Ethan A. Chi, John Hewitt, and Christopher D. Manning. Find- ing Universal Grammatical Relations in Multilingual BERT. InProceedings of ACL 2020, pages 5564–5577,
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.