Phonological distances for linguistic typology and the origin of Indo-European languages
Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3
The pith
Phoneme sequences modeled as second-order Markov chains yield distances that recover language families and correlate with geography to support a Steppe origin for Indo-European languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phoneme sequences modeled as second-order Markov chains capture the statistical correlations of a phonological system; the resulting information-theoretic distances, augmented by articulatory features, recover major language families, reveal contact-induced convergence, and correlate with geographic distance in a manner consistent with the Steppe hypothesis for the Indo-European homeland.
What carries the argument
Second-order Markov chain modeling of phoneme sequences combined with an information-theoretic distance that incorporates articulatory features.
If this is right
- The distance matrix supplies a quantitative typology tool that classifies languages without relying on lexical data.
- Contact-induced convergence between languages becomes detectable as reduced phonological distance relative to family membership.
- Geographic correlation in the distance matrix directly constrains homeland locations for language families.
- The same pipeline can be applied to additional families to test or refine migration hypotheses.
Where Pith is reading between the lines
- Combining these distances with time-calibrated divergence models could yield rough estimates of when families split.
- The method offers an independent check on lexical or grammatical phylogenies that may be biased by borrowing.
- Extension to reconstructed proto-forms or ancient texts would test whether the geographic signal persists deeper in time.
Load-bearing premise
That modeling phoneme sequences as second-order Markov chains captures the essential statistical correlations of phonological systems well enough for the derived distances to reflect large-scale linguistic relatedness and geography.
What would settle it
A test set of languages in which the computed phonological distances fail to recover known families or show no correlation with geographic separation.
Figures
read the original abstract
We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an information-theoretic framework in which phoneme sequences from a multilingual parallel corpus are modeled as second-order Markov chains (incorporating articulatory features) to define phonological distances among 67 languages. It claims these distances recover major language families, detect contact-induced convergence, exhibit a clear correlation with geographic distance, and thereby constrain a plausible homeland for the Indo-European family consistent with the Steppe hypothesis.
Significance. If the distances prove to reflect deep genetic relatedness rather than sampling or contact artifacts, the approach would supply a novel, corpus-driven quantitative tool for linguistic typology and evolutionary linguistics, offering independent evidence for family relationships and homelands. The parallel-corpus basis and feature incorporation are positive elements, but the absence of reported validation, robustness checks, or higher-order comparisons substantially reduces the immediate significance.
major comments (2)
- [Abstract] Abstract: the central assertion that 'phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system' is load-bearing for all downstream claims yet is presented without comparison to higher-order Markov models, syllable-level constraints, or long-range dependencies; if these omitted structures dominate family signals, the reported geographic correlation cannot be taken as evidence of deep relatedness.
- [Abstract] Abstract and results sections: the claim of a 'clear correlation with geographic distance' used to constrain the Indo-European homeland lacks any reported controls for geographic sampling bias, recent contact effects, or alternative distance metrics; without these, the Steppe-homeland inference rests on an unvalidated correlation whose robustness cannot be assessed.
minor comments (2)
- [Abstract] The exact definition of the distance metric (including how articulatory features are combined with the Markov transition probabilities) should be stated explicitly with an equation rather than described only in prose.
- [Abstract] The manuscript should specify the parallel corpus, the precise set of 67 languages, and the geographic distance measure employed, as these details are essential for reproducibility and evaluation of the correlation.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central assertion that 'phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system' is load-bearing for all downstream claims yet is presented without comparison to higher-order Markov models, syllable-level constraints, or long-range dependencies; if these omitted structures dominate family signals, the reported geographic correlation cannot be taken as evidence of deep relatedness.
Authors: We selected second-order Markov chains to model immediate phoneme dependencies while remaining computationally tractable with the available parallel corpus data. The empirical success of this model in recovering major language families and contact signatures provides indirect support for its adequacy. Nevertheless, we agree that explicit comparisons would strengthen the central claim. In the revised manuscript we will add a supplementary analysis comparing first-, second-, and third-order models on a representative subset of languages, showing that the second-order distance matrix yields the clearest family structure without the sparsity problems of higher orders. revision: yes
-
Referee: [Abstract] Abstract and results sections: the claim of a 'clear correlation with geographic distance' used to constrain the Indo-European homeland lacks any reported controls for geographic sampling bias, recent contact effects, or alternative distance metrics; without these, the Steppe-homeland inference rests on an unvalidated correlation whose robustness cannot be assessed.
Authors: The reported geographic correlation is an observational result derived from the phonological distances. We acknowledge that the manuscript would benefit from explicit robustness checks. In revision we will add (i) partial Mantel tests controlling for language-family membership to mitigate contact and genetic effects, (ii) a discussion of geographic sampling balance across the 67 languages, and (iii) a comparison of our feature-augmented distance against a simple phoneme-edit-distance baseline. These additions will allow readers to evaluate the strength of the Steppe-homeland inference more rigorously. revision: yes
Circularity Check
No significant circularity; distances computed directly from corpus data
full rationale
The derivation computes phonological distances from a parallel corpus of 67 languages by modeling phoneme sequences as second-order Markov chains and incorporating articulatory features. These distances are then observed to recover families, show contact effects, and correlate with external geographic data to constrain the Indo-European homeland. No equation or step reduces by construction to a fitted parameter, self-citation chain, or input that already encodes the target result. The Markov modeling choice and distance metric are applied uniformly to raw corpus data without post-hoc tuning to geography or known families, making the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Markov chain order
axioms (2)
- domain assumption Phoneme sequences can be modeled as second-order Markov chains that essentially capture the statistical correlations of a phonological system
- domain assumption The resulting distance matrix incorporating articulatory features reflects true linguistic relatedness and geographic patterns
Reference graph
Works this paper leans on
-
[1]
Goebl, Literary and linguistic computing21, 411 (2006)
H. Goebl, Literary and linguistic computing21, 411 (2006)
work page 2006
-
[2]
J. Nerbonne and W. Heeringa, inComputational phonology: Third meeting of the ACL special interest group in computational phonology(1997)
work page 1997
-
[3]
J. Nerbonne and W. Heeringa, Dialectologia et Geolinguistica9, 69–83 (2001)
work page 2001
-
[4]
B. R. Chiswick and P. W. Miller, Journal of multilingual and multicultural development26, 1 (2005)
work page 2005
-
[5]
Esser,Migration, language and integration(WZB Berlin, 2006)
H. Esser,Migration, language and integration(WZB Berlin, 2006)
work page 2006
-
[6]
Levshina, Linguistic Typology26, 129 (2022)
N. Levshina, Linguistic Typology26, 129 (2022)
work page 2022
-
[7]
G. B. Jenset and B. McGillivray,Quantitative historical linguistics: A corpus framework, Vol. 26 (Oxford University Press, 2017)
work page 2017
-
[8]
P. Gamallo, J. R. Pichel, and I. Alegria, Physica A: Statistical Mechanics and its Applications 484, 152 (2017)
work page 2017
-
[9]
C. H. Brown, E. W. Holman, S. Wichmann, and V. Velupillai, Language Typology and Universals61, 285 (2008)
work page 2008
-
[10]
S. Wichmann, E. W. Holman, D. Bakker, and C. H. Brown, Physica A: Statistical Mechanics and its Applications389, 3632 (2010)
work page 2010
- [11]
-
[12]
F. Petroni and M. Serva, Journal of Statistical Mechanics: Theory and Experiment2008, P08012 (2008)
work page 2008
- [13]
-
[14]
A. Estarrona, I. Etxeberria, M. Padilla-Moyano, and A. Soraluze, Procesamiento del Lenguaje Natural70, 53 (2023)
work page 2023
- [15]
-
[16]
S. E. Eden,Measuring phonological distance between languages, Phd thesis, UCL (University College London) (2018)
work page 2018
-
[17]
P. Lara-Mart´ ınez, B. Obreg´ on-Quintana, C. Reyes-Manzano, I. L´ opez-Rodr´ ıguez, and L. Guzm´ an-Vargas, PLOS ONE17, e0274617 (2022)
work page 2022
-
[18]
J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024). 23
work page 2024
-
[19]
M.-C. De Marneffe, C. D. Manning, J. Nivre, and D. Zeman, Computational linguistics47, 255 (2021)
work page 2021
-
[20]
Li, Journal of Quantitative Linguistics , 1 (2025)
W. Li, Journal of Quantitative Linguistics , 1 (2025)
work page 2025
-
[21]
P. Jeszenszky, P. Stoeckle, E. Glaser, and R. Weibel, Journal of Linguistic Geography5, 86 (2017)
work page 2017
- [22]
- [23]
-
[24]
R. Bouckaert, P. Lemey, M. Dunn, S. J. Greenhill, A. V. Alekseyenko, A. J. Drummond, R. D. Gray, M. A. Suchard, and Q. D. Atkinson, Science337, 957 (2012)
work page 2012
- [25]
-
[26]
P. Heggarty, C. Anderson, M. Scarborough, B. King, R. Bouckaert, L. Jocz, M. J. K¨ ummel, T. J¨ ugel, B. Irslinger, R. Pooth,et al., Science381, eabg0818 (2023)
work page 2023
-
[27]
D. S´ anchez, L. Zunino, J. De Gregorio, R. Toral, and C. Mirasso, Chaos: An Interdisciplinary Journal of Nonlinear Science33, 033121 (2023)
work page 2023
-
[28]
C. Christodouloupoulos and M. Steedman, Language resources and evaluation49, 375 (2015)
work page 2015
-
[29]
YouVersion, “Bible texts,”https://www.bible.com(2026), downloaded versions of biblical texts in multiple languages
work page 2026
-
[30]
M. Bernard and H. Titeux, Journal of Open Source Software6, 3958 (2021)
work page 2021
-
[31]
D. R. Mortensen, S. Dalmia, and P. Littell, inProceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC, edited by N. C. C. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (European Language Resources Ass...
work page 2018
-
[32]
M. Mavridis, “Data and code for “Phonological distances for linguistic typology and the origin of Indo-European languages”,”https://github.com/MariusMavridis/ Phonetic-Distances/(2026)
work page 2026
-
[33]
J. L. Lee, L. F. Ashby, M. E. Garza, Y. Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, inProceedings of the Twelfth Language Resources and Evaluation Conference, edited by N. Calzolari, F. B´ echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eu...
work page 2020
-
[34]
Wiktionary, the free dictionary,
Wiktionary contributors, “Wiktionary, the free dictionary,”https://www.wiktionary.org/ (2026), online collaborative dictionary
work page 2026
-
[35]
R. M. Dixon and A. Y. Aikhenvald,Word: A cross-linguistic typology(Cambridge University Press, 2003)
work page 2003
-
[36]
A. E. Raftery, Journal of the Royal Statistical Society Series B: Statistical Methodology47, 528 (1985)
work page 1985
-
[37]
J. P. Crutchfield and D. P. Feldman, Chaos: An Interdisciplinary Journal of Nonlinear Science 13, 25 (2003)
work page 2003
- [38]
-
[39]
J. De Gregorio, D. S´ anchez, and R. Toral, Chaos, Solitons & Fractals165, 112797 (2022)
work page 2022
-
[40]
I. Nemenman, F. Shafee, and W. Bialek, Advances in Neural Information Processing Systems 14, 471 (2001)
work page 2001
- [41]
-
[42]
M. A. Kohler, W. D. Andrews, J. P. Campbell, and J. Herndndez-Cordero, inConference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat. No. 01CH37256), Vol. 2 (IEEE, 2001) pp. 1557–1561
work page 2001
-
[43]
J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024)
work page 2024
-
[44]
D. R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. S. Levin, inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers(ACL, 2016) pp. 3475–3484
work page 2016
-
[45]
L. V. Kantorovich, Management Science6, 366 (1960)
work page 1960
-
[46]
V. M. Panaretos and Y. Zemel, Annual Review of Statistics and its Application6, 405 (2019)
work page 2019
- [47]
-
[48]
E. Levina and P. Bickel, inProceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 2 (IEEE, 2001) pp. 251–256
work page 2001
-
[49]
B. M. Bolstad, R. A. Irizarry, M. ˚Astrand, and T. P. Speed, Bioinformatics19, 185 (2003)
work page 2003
-
[50]
S. N. Evans and F. A. Matsen, Journal of the Royal Statistical Society Series B: Statistical Methodology74, 569 (2012)
work page 2012
-
[51]
D. Alvarez-Melis and T. Jaakkola, inProceedings of the 2018 conference on empirical methods in natural language processing(2018) pp. 1881–1890. 25
work page 2018
-
[52]
T. Louf, D. S´ anchez, and J. J. Ramasco, Physical Review Research3, 043146 (2021)
work page 2021
-
[53]
Cuturi, Advances in neural information processing systems26(2013)
M. Cuturi, Advances in neural information processing systems26(2013)
work page 2013
-
[54]
R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. H. Gayraud, H. Janati, A. Rako- tomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer, Journal of Machine Learning Research22, 1 (2021)
work page 2021
-
[55]
J. H. Ward Jr, Journal of the American statistical association58, 236 (1963)
work page 1963
-
[56]
M. S. Dryer and M. Haspelmath, eds.,WALS Online (v2020.4)(Zenodo, 2013)
work page 2013
-
[57]
T. Pronk, “Balto-slavic,” inThe Indo-European Language Family, edited by T. Olander (Cam- bridge University Press, 2022) p. 269
work page 2022
-
[58]
J. I. Hualde,Basque phonology(Routledge, 2004)
work page 2004
-
[59]
K. Leppik and P. Lippus, inXXVIII Fonetiikan p¨ aiv¨ at. Turku 25.-26. lokakuuta 2013. Kon- ferenssijulkaisu. Turku: Turun yliopisto(2014) pp. 19–26
work page 2013
-
[60]
I. Feldhausen,Sentential form and prosodic structure of Catalan(John Benjamins Publishing Company, 2010)
work page 2010
-
[61]
J. P. Mallory and D. Q. Adams, inEncyclopedia of Indo-European Culture(Fitzroy Dearborn, London, 1997) pp. 8–11
work page 1997
-
[62]
Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp
B. Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp. 139–148
work page 2003
-
[63]
Incidentally, we can employ a phonological distance calculation to quantitatively verify that our corpus is representative. Thus, we compute the distance between English probability distributions of the Bible and Herman Melville’s Moby Dick, a frequently analyzed text in computational and quantitative linguistics [W. Ebeling and T. Poschel, Europhysics Le...
work page 1994
-
[64]
G. J. Sz´ ekely, M. L. Rizzo, and N. K. Bakirov, The Annals of Statistics35, 2769 (2007)
work page 2007
-
[65]
Q. D. Atkinson, Science332, 346 (2011)
work page 2011
-
[66]
J. Fort and J. P´ erez-Losada, Journal of The Royal Society Interface13, 20160185 (2016). 26
work page 2016
-
[67]
T. F. Jaeger, P. Graff, W. Croft, and D. Pontillo, Linguistic Tipology15, 281 (2011)
work page 2011
- [68]
-
[69]
N. Balakrishnan and V. B. Nevzorov,A primer on statistical distributions(John Wiley & Sons, 2004) Chap. 27
work page 2004
-
[70]
Gimbutas, Journal of Indo-European Studies1, 1 (1973)
M. Gimbutas, Journal of Indo-European Studies1, 1 (1973)
work page 1973
-
[71]
J. P. Mallory,In search of the Indo-Europeans: Language, archaeology and myth(Thames and Hudson, 1989)
work page 1989
-
[72]
G. Kroonen, A. Jakob, A. I. Palm´ er, P. van Sluis, and A. Wigman, PLOS ONE17, e0275744 (2022)
work page 2022
-
[73]
I. Lazaridis, N. Patterson, D. Anthony, L. Vyazov, R. Fournier, H. Ringbauer, I. Olalde, A. A. Khokhlov, E. P. Kitov, N. I. Shishlina,et al., Nature639, 132 (2025)
work page 2025
-
[74]
Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)
C. Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)
work page 1990
-
[75]
W. Labov,The Social Stratification of English in New York City(Cambridge University Press, Cambridge, UK, 1966)
work page 1966
-
[76]
M. Haspelmath, inLanguage typology and language universals.(Handb¨ ucher zur Sprach-und Kommunikationswissenschaft)(de Gruyter, 2001) pp. 1492–1510
work page 2001
-
[77]
C. P. Masica,Defining a linguistic area: South Asia(Orient Blackswan, 2005)
work page 2005
-
[78]
M. Cysouw, inSpace in language and linguistics: Geographical, interactional, and cognitive perspectives(de Gruyter, 2013)
work page 2013
-
[79]
S. J. Greenhill, P. Heggarty, and R. D. Gray, The handbook of historical linguistics2, 226 (2020)
work page 2020
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.