Phonological distances for linguistic typology and the origin of Indo-European languages

David Sanchez; Juan De Gregorio; Marius Mavridis; Raul Toral

arxiv: 2604.11565 · v1 · submitted 2026-04-13 · 💻 cs.CL · cond-mat.stat-mech· cs.IT· math.IT· physics.soc-ph

Phonological distances for linguistic typology and the origin of Indo-European languages

Marius Mavridis , Juan De Gregorio , Raul Toral , David Sanchez This is my paper

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CL cond-mat.stat-mechcs.ITmath.ITphysics.soc-ph

keywords phonological distanceslinguistic typologyIndo-European languagesMarkov chainsinformation theorylanguage familiesSteppe hypothesisgeographic correlation

0 comments

The pith

Phoneme sequences modeled as second-order Markov chains yield distances that recover language families and correlate with geography to support a Steppe origin for Indo-European languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that short-range phoneme dependencies, captured by treating sequences as second-order Markov chains, encode large-scale patterns of linguistic relatedness across families. An information-theoretic distance metric that folds in articulatory features of phonemes then produces a matrix for 67 languages from parallel text. This matrix recovers major families, detects contact convergence, and shows a clear correlation with physical distance, which in turn constrains the plausible homeland of the Indo-European family to the steppe region.

Core claim

Phoneme sequences modeled as second-order Markov chains capture the statistical correlations of a phonological system; the resulting information-theoretic distances, augmented by articulatory features, recover major language families, reveal contact-induced convergence, and correlate with geographic distance in a manner consistent with the Steppe hypothesis for the Indo-European homeland.

What carries the argument

Second-order Markov chain modeling of phoneme sequences combined with an information-theoretic distance that incorporates articulatory features.

If this is right

The distance matrix supplies a quantitative typology tool that classifies languages without relying on lexical data.
Contact-induced convergence between languages becomes detectable as reduced phonological distance relative to family membership.
Geographic correlation in the distance matrix directly constrains homeland locations for language families.
The same pipeline can be applied to additional families to test or refine migration hypotheses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining these distances with time-calibrated divergence models could yield rough estimates of when families split.
The method offers an independent check on lexical or grammatical phylogenies that may be biased by borrowing.
Extension to reconstructed proto-forms or ancient texts would test whether the geographic signal persists deeper in time.

Load-bearing premise

That modeling phoneme sequences as second-order Markov chains captures the essential statistical correlations of phonological systems well enough for the derived distances to reflect large-scale linguistic relatedness and geography.

What would settle it

A test set of languages in which the computed phonological distances fail to recover known families or show no correlation with geographic separation.

Figures

Figures reproduced from arXiv: 2604.11565 by David Sanchez, Juan De Gregorio, Marius Mavridis, Raul Toral.

**Figure 2.** Figure 2: FIG. 2. Left: Predictability gain for English phonological classes as a function of the block size. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Left: Distributions of English phoneme trigrams. The five most frequent 3-phones are [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Heatmap representation of the phonological distances between all pairs of languages in our [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Same heatmap as Fig. 4 but showing only Indo-European languages Languages. The [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Wasserstein distance between the 3-phone probability distributions of (left) all languages [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7. Heatmap for the sum of squared residuals [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8. Comparison of four different distances between the feature vector representations of selected [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phonological distances from second-order Markov chains recover families and show a geographic correlation for Indo-European, but the order-2 assumption needs checking against higher dependencies.

read the letter

The core result is that distances computed from second-order Markov models on phoneme sequences, plus articulatory features, from a 67-language parallel corpus recover major families and produce a clear geographic correlation that the authors use to support a Steppe homeland for Indo-European languages. That combination of family recovery and homeland constraint is the main takeaway worth noting upfront. The parallel corpus choice is a practical strength because it reduces content-based noise in the phoneme counts. Incorporating articulatory features into the distance metric adds a layer of phonetic grounding that pure sequence stats often lack. Recovering known families at least shows the distances are not random noise. The geographic correlation itself is presented as evidence for the homeland inference, which is a direct move into evolutionary linguistics territory. The soft spot sits in the modeling assumption. Second-order Markov chains capture only immediate predecessor pairs, yet phonological systems include syllable structure, harmony, and longer prosodic patterns that could carry more of the family signal. If those higher-order statistics dominate, the observed distances and geographic link might partly reflect contact or corpus sampling rather than deep relatedness. The abstract frames the order-2 model as essentially sufficient, but that claim would need explicit sensitivity checks or comparisons to higher orders to hold weight. The homeland conclusion also depends on how cleanly the correlation separates relatedness from geography-driven contact. This work sits at the intersection of computational typology and historical linguistics. Readers already building or testing distance measures for language families would get the most immediate value, as they could adapt the metric and run their own controls. It deserves peer review because the quantitative framing is concrete enough for referees to evaluate the distance definition, the correlation statistics, and the robustness of the family recovery. Reviewers can then decide whether the Steppe alignment survives the necessary checks on the Markov order and potential confounds.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an information-theoretic framework in which phoneme sequences from a multilingual parallel corpus are modeled as second-order Markov chains (incorporating articulatory features) to define phonological distances among 67 languages. It claims these distances recover major language families, detect contact-induced convergence, exhibit a clear correlation with geographic distance, and thereby constrain a plausible homeland for the Indo-European family consistent with the Steppe hypothesis.

Significance. If the distances prove to reflect deep genetic relatedness rather than sampling or contact artifacts, the approach would supply a novel, corpus-driven quantitative tool for linguistic typology and evolutionary linguistics, offering independent evidence for family relationships and homelands. The parallel-corpus basis and feature incorporation are positive elements, but the absence of reported validation, robustness checks, or higher-order comparisons substantially reduces the immediate significance.

major comments (2)

[Abstract] Abstract: the central assertion that 'phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system' is load-bearing for all downstream claims yet is presented without comparison to higher-order Markov models, syllable-level constraints, or long-range dependencies; if these omitted structures dominate family signals, the reported geographic correlation cannot be taken as evidence of deep relatedness.
[Abstract] Abstract and results sections: the claim of a 'clear correlation with geographic distance' used to constrain the Indo-European homeland lacks any reported controls for geographic sampling bias, recent contact effects, or alternative distance metrics; without these, the Steppe-homeland inference rests on an unvalidated correlation whose robustness cannot be assessed.

minor comments (2)

[Abstract] The exact definition of the distance metric (including how articulatory features are combined with the Markov transition probabilities) should be stated explicitly with an equation rather than described only in prose.
[Abstract] The manuscript should specify the parallel corpus, the precise set of 67 languages, and the geographic distance measure employed, as these details are essential for reproducibility and evaluation of the correlation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central assertion that 'phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system' is load-bearing for all downstream claims yet is presented without comparison to higher-order Markov models, syllable-level constraints, or long-range dependencies; if these omitted structures dominate family signals, the reported geographic correlation cannot be taken as evidence of deep relatedness.

Authors: We selected second-order Markov chains to model immediate phoneme dependencies while remaining computationally tractable with the available parallel corpus data. The empirical success of this model in recovering major language families and contact signatures provides indirect support for its adequacy. Nevertheless, we agree that explicit comparisons would strengthen the central claim. In the revised manuscript we will add a supplementary analysis comparing first-, second-, and third-order models on a representative subset of languages, showing that the second-order distance matrix yields the clearest family structure without the sparsity problems of higher orders. revision: yes
Referee: [Abstract] Abstract and results sections: the claim of a 'clear correlation with geographic distance' used to constrain the Indo-European homeland lacks any reported controls for geographic sampling bias, recent contact effects, or alternative distance metrics; without these, the Steppe-homeland inference rests on an unvalidated correlation whose robustness cannot be assessed.

Authors: The reported geographic correlation is an observational result derived from the phonological distances. We acknowledge that the manuscript would benefit from explicit robustness checks. In revision we will add (i) partial Mantel tests controlling for language-family membership to mitigate contact and genetic effects, (ii) a discussion of geographic sampling balance across the 67 languages, and (iii) a comparison of our feature-augmented distance against a simple phoneme-edit-distance baseline. These additions will allow readers to evaluate the strength of the Steppe-homeland inference more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; distances computed directly from corpus data

full rationale

The derivation computes phonological distances from a parallel corpus of 67 languages by modeling phoneme sequences as second-order Markov chains and incorporating articulatory features. These distances are then observed to recover families, show contact effects, and correlate with external geographic data to constrain the Indo-European homeland. No equation or step reduces by construction to a fitted parameter, self-citation chain, or input that already encodes the target result. The Markov modeling choice and distance metric are applied uniformly to raw corpus data without post-hoc tuning to geography or known families, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that second-order Markov chains suffice to capture phonological statistics for relatedness inference; no free parameters are explicitly fitted in the abstract, and no new entities are introduced.

free parameters (1)

Markov chain order
Fixed at second-order in the abstract; this choice is a modeling decision that could be tuned and affects captured dependencies.

axioms (2)

domain assumption Phoneme sequences can be modeled as second-order Markov chains that essentially capture the statistical correlations of a phonological system
Directly stated in abstract as the key enabling finding for the distance metric.
domain assumption The resulting distance matrix incorporating articulatory features reflects true linguistic relatedness and geographic patterns
Invoked to interpret recovery of families and the geographic correlation for homeland constraint.

pith-pipeline@v0.9.0 · 5425 in / 1364 out tokens · 53429 ms · 2026-05-10T15:52:03.462279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages

[1]

Goebl, Literary and linguistic computing21, 411 (2006)

H. Goebl, Literary and linguistic computing21, 411 (2006)

work page 2006
[2]

Nerbonne and W

J. Nerbonne and W. Heeringa, inComputational phonology: Third meeting of the ACL special interest group in computational phonology(1997)

work page 1997
[3]

Nerbonne and W

J. Nerbonne and W. Heeringa, Dialectologia et Geolinguistica9, 69–83 (2001)

work page 2001
[4]

B. R. Chiswick and P. W. Miller, Journal of multilingual and multicultural development26, 1 (2005)

work page 2005
[5]

Esser,Migration, language and integration(WZB Berlin, 2006)

H. Esser,Migration, language and integration(WZB Berlin, 2006)

work page 2006
[6]

Levshina, Linguistic Typology26, 129 (2022)

N. Levshina, Linguistic Typology26, 129 (2022)

work page 2022
[7]

G. B. Jenset and B. McGillivray,Quantitative historical linguistics: A corpus framework, Vol. 26 (Oxford University Press, 2017)

work page 2017
[8]

Gamallo, J

P. Gamallo, J. R. Pichel, and I. Alegria, Physica A: Statistical Mechanics and its Applications 484, 152 (2017)

work page 2017
[9]

C. H. Brown, E. W. Holman, S. Wichmann, and V. Velupillai, Language Typology and Universals61, 285 (2008)

work page 2008
[10]

Wichmann, E

S. Wichmann, E. W. Holman, D. Bakker, and C. H. Brown, Physica A: Statistical Mechanics and its Applications389, 3632 (2010)

work page 2010
[11]

Serva and F

M. Serva and F. Petroni, Europhysics Letters81, 68005 (2008)

work page 2008
[12]

Petroni and M

F. Petroni and M. Serva, Journal of Statistical Mechanics: Theory and Experiment2008, P08012 (2008)

work page 2008
[13]

Gamallo, J

P. Gamallo, J. R. Pichel, and I. Alegria, Information11, 181 (2020)

work page 2020
[14]

Estarrona, I

A. Estarrona, I. Etxeberria, M. Padilla-Moyano, and A. Soraluze, Procesamiento del Lenguaje Natural70, 53 (2023)

work page 2023
[15]

Marian, J

V. Marian, J. Bartolotti, S. Chabal, and A. Shook, PLOS ONE7, e43230 (2012)

work page 2012
[16]

S. E. Eden,Measuring phonological distance between languages, Phd thesis, UCL (University College London) (2018)

work page 2018
[17]

Lara-Mart´ ınez, B

P. Lara-Mart´ ınez, B. Obreg´ on-Quintana, C. Reyes-Manzano, I. L´ opez-Rodr´ ıguez, and L. Guzm´ an-Vargas, PLOS ONE17, e0274617 (2022)

work page 2022
[18]

De Gregorio, R

J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024). 23

work page 2024
[19]

De Marneffe, C

M.-C. De Marneffe, C. D. Manning, J. Nivre, and D. Zeman, Computational linguistics47, 255 (2021)

work page 2021
[20]

Li, Journal of Quantitative Linguistics , 1 (2025)

W. Li, Journal of Quantitative Linguistics , 1 (2025)

work page 2025
[21]

Jeszenszky, P

P. Jeszenszky, P. Stoeckle, E. Glaser, and R. Weibel, Journal of Linguistic Geography5, 86 (2017)

work page 2017
[22]

J¨ ager, Scientific data5, 1 (2018)

G. J¨ ager, Scientific data5, 1 (2018)

work page 2018
[23]

Wichmann, A

S. Wichmann, A. M¨ uller, and V. Velupillai, Diachronica27, 247 (2010)

work page 2010
[24]

Bouckaert, P

R. Bouckaert, P. Lemey, M. Dunn, S. J. Greenhill, A. V. Alekseyenko, A. J. Drummond, R. D. Gray, M. A. Suchard, and Q. D. Atkinson, Science337, 957 (2012)

work page 2012
[25]

Chang, C

W. Chang, C. Cathcart, D. Hall, and A. Garrett, Language91, 194 (2015)

work page 2015
[26]

Heggarty, C

P. Heggarty, C. Anderson, M. Scarborough, B. King, R. Bouckaert, L. Jocz, M. J. K¨ ummel, T. J¨ ugel, B. Irslinger, R. Pooth,et al., Science381, eabg0818 (2023)

work page 2023
[27]

S´ anchez, L

D. S´ anchez, L. Zunino, J. De Gregorio, R. Toral, and C. Mirasso, Chaos: An Interdisciplinary Journal of Nonlinear Science33, 033121 (2023)

work page 2023
[28]

Christodouloupoulos and M

C. Christodouloupoulos and M. Steedman, Language resources and evaluation49, 375 (2015)

work page 2015
[29]

Bible texts,

YouVersion, “Bible texts,”https://www.bible.com(2026), downloaded versions of biblical texts in multiple languages

work page 2026
[30]

Bernard and H

M. Bernard and H. Titeux, Journal of Open Source Software6, 3958 (2021)

work page 2021
[31]

D. R. Mortensen, S. Dalmia, and P. Littell, inProceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC, edited by N. C. C. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (European Language Resources Ass...

work page 2018
[32]

Data and code for “Phonological distances for linguistic typology and the origin of Indo-European languages

M. Mavridis, “Data and code for “Phonological distances for linguistic typology and the origin of Indo-European languages”,”https://github.com/MariusMavridis/ Phonetic-Distances/(2026)

work page 2026
[33]

J. L. Lee, L. F. Ashby, M. E. Garza, Y. Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, inProceedings of the Twelfth Language Resources and Evaluation Conference, edited by N. Calzolari, F. B´ echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eu...

work page 2020
[34]

Wiktionary, the free dictionary,

Wiktionary contributors, “Wiktionary, the free dictionary,”https://www.wiktionary.org/ (2026), online collaborative dictionary

work page 2026
[35]

R. M. Dixon and A. Y. Aikhenvald,Word: A cross-linguistic typology(Cambridge University Press, 2003)

work page 2003
[36]

A. E. Raftery, Journal of the Royal Statistical Society Series B: Statistical Methodology47, 528 (1985)

work page 1985
[37]

J. P. Crutchfield and D. P. Feldman, Chaos: An Interdisciplinary Journal of Nonlinear Science 13, 25 (2003)

work page 2003
[38]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Chaos36, 033124 (2026)

work page 2026
[39]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Chaos, Solitons & Fractals165, 112797 (2022)

work page 2022
[40]

Nemenman, F

I. Nemenman, F. Shafee, and W. Bialek, Advances in Neural Information Processing Systems 14, 471 (2001)

work page 2001
[41]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Entropy26, 79 (2024)

work page 2024
[42]

M. A. Kohler, W. D. Andrews, J. P. Campbell, and J. Herndndez-Cordero, inConference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat. No. 01CH37256), Vol. 2 (IEEE, 2001) pp. 1557–1561

work page 2001
[43]

De Gregorio, R

J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024)

work page 2024
[44]

D. R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. S. Levin, inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers(ACL, 2016) pp. 3475–3484

work page 2016
[45]

L. V. Kantorovich, Management Science6, 366 (1960)

work page 1960
[46]

V. M. Panaretos and Y. Zemel, Annual Review of Statistics and its Application6, 405 (2019)

work page 2019
[47]

Rubner, C

Y. Rubner, C. Tomasi, and L. J. Guibas, inSixth international conference on computer vision (IEEE Cat. No. 98CH36271)(IEEE, 1998) pp. 59–66

work page 1998
[48]

Levina and P

E. Levina and P. Bickel, inProceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 2 (IEEE, 2001) pp. 251–256

work page 2001
[49]

B. M. Bolstad, R. A. Irizarry, M. ˚Astrand, and T. P. Speed, Bioinformatics19, 185 (2003)

work page 2003
[50]

S. N. Evans and F. A. Matsen, Journal of the Royal Statistical Society Series B: Statistical Methodology74, 569 (2012)

work page 2012
[51]

Alvarez-Melis and T

D. Alvarez-Melis and T. Jaakkola, inProceedings of the 2018 conference on empirical methods in natural language processing(2018) pp. 1881–1890. 25

work page 2018
[52]

T. Louf, D. S´ anchez, and J. J. Ramasco, Physical Review Research3, 043146 (2021)

work page 2021
[53]

Cuturi, Advances in neural information processing systems26(2013)

M. Cuturi, Advances in neural information processing systems26(2013)

work page 2013
[54]

Flamary, N

R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. H. Gayraud, H. Janati, A. Rako- tomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer, Journal of Machine Learning Research22, 1 (2021)

work page 2021
[55]

J. H. Ward Jr, Journal of the American statistical association58, 236 (1963)

work page 1963
[56]

M. S. Dryer and M. Haspelmath, eds.,WALS Online (v2020.4)(Zenodo, 2013)

work page 2013
[57]

Balto-slavic,

T. Pronk, “Balto-slavic,” inThe Indo-European Language Family, edited by T. Olander (Cam- bridge University Press, 2022) p. 269

work page 2022
[58]

J. I. Hualde,Basque phonology(Routledge, 2004)

work page 2004
[59]

Leppik and P

K. Leppik and P. Lippus, inXXVIII Fonetiikan p¨ aiv¨ at. Turku 25.-26. lokakuuta 2013. Kon- ferenssijulkaisu. Turku: Turun yliopisto(2014) pp. 19–26

work page 2013
[60]

Feldhausen,Sentential form and prosodic structure of Catalan(John Benjamins Publishing Company, 2010)

I. Feldhausen,Sentential form and prosodic structure of Catalan(John Benjamins Publishing Company, 2010)

work page 2010
[61]

J. P. Mallory and D. Q. Adams, inEncyclopedia of Indo-European Culture(Fitzroy Dearborn, London, 1997) pp. 8–11

work page 1997
[62]

Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp

B. Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp. 139–148

work page 2003
[63]

Incidentally, we can employ a phonological distance calculation to quantitatively verify that our corpus is representative. Thus, we compute the distance between English probability distributions of the Bible and Herman Melville’s Moby Dick, a frequently analyzed text in computational and quantitative linguistics [W. Ebeling and T. Poschel, Europhysics Le...

work page 1994
[64]

G. J. Sz´ ekely, M. L. Rizzo, and N. K. Bakirov, The Annals of Statistics35, 2769 (2007)

work page 2007
[65]

Q. D. Atkinson, Science332, 346 (2011)

work page 2011
[66]

Fort and J

J. Fort and J. P´ erez-Losada, Journal of The Royal Society Interface13, 20160185 (2016). 26

work page 2016
[67]

T. F. Jaeger, P. Graff, W. Croft, and D. Pontillo, Linguistic Tipology15, 281 (2011)

work page 2011
[68]

Hunley, C

K. Hunley, C. Bowern, and M. Healy, Proceedings of the Royal Society B: Biological Sciences 279, 2281 (2012)

work page 2012
[69]

Balakrishnan and V

N. Balakrishnan and V. B. Nevzorov,A primer on statistical distributions(John Wiley & Sons, 2004) Chap. 27

work page 2004
[70]

Gimbutas, Journal of Indo-European Studies1, 1 (1973)

M. Gimbutas, Journal of Indo-European Studies1, 1 (1973)

work page 1973
[71]

J. P. Mallory,In search of the Indo-Europeans: Language, archaeology and myth(Thames and Hudson, 1989)

work page 1989
[72]

Kroonen, A

G. Kroonen, A. Jakob, A. I. Palm´ er, P. van Sluis, and A. Wigman, PLOS ONE17, e0275744 (2022)

work page 2022
[73]

Lazaridis, N

I. Lazaridis, N. Patterson, D. Anthony, L. Vyazov, R. Fournier, H. Ringbauer, I. Olalde, A. A. Khokhlov, E. P. Kitov, N. I. Shishlina,et al., Nature639, 132 (2025)

work page 2025
[74]

Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)

C. Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)

work page 1990
[75]

Labov,The Social Stratification of English in New York City(Cambridge University Press, Cambridge, UK, 1966)

W. Labov,The Social Stratification of English in New York City(Cambridge University Press, Cambridge, UK, 1966)

work page 1966
[76]

Haspelmath, inLanguage typology and language universals.(Handb¨ ucher zur Sprach-und Kommunikationswissenschaft)(de Gruyter, 2001) pp

M. Haspelmath, inLanguage typology and language universals.(Handb¨ ucher zur Sprach-und Kommunikationswissenschaft)(de Gruyter, 2001) pp. 1492–1510

work page 2001
[77]

C. P. Masica,Defining a linguistic area: South Asia(Orient Blackswan, 2005)

work page 2005
[78]

Cysouw, inSpace in language and linguistics: Geographical, interactional, and cognitive perspectives(de Gruyter, 2013)

M. Cysouw, inSpace in language and linguistics: Geographical, interactional, and cognitive perspectives(de Gruyter, 2013)

work page 2013
[79]

S. J. Greenhill, P. Heggarty, and R. D. Gray, The handbook of historical linguistics2, 226 (2020)

work page 2020
[80]

Raymond G

J. Raymond G. Gordon, ed.,Ethnologue: Languages of the World, 15th ed. (SIL International, Dallas, TX, 2005)https://www.ethnologue.com. 27

work page 2005

[1] [1]

Goebl, Literary and linguistic computing21, 411 (2006)

H. Goebl, Literary and linguistic computing21, 411 (2006)

work page 2006

[2] [2]

Nerbonne and W

J. Nerbonne and W. Heeringa, inComputational phonology: Third meeting of the ACL special interest group in computational phonology(1997)

work page 1997

[3] [3]

Nerbonne and W

J. Nerbonne and W. Heeringa, Dialectologia et Geolinguistica9, 69–83 (2001)

work page 2001

[4] [4]

B. R. Chiswick and P. W. Miller, Journal of multilingual and multicultural development26, 1 (2005)

work page 2005

[5] [5]

Esser,Migration, language and integration(WZB Berlin, 2006)

H. Esser,Migration, language and integration(WZB Berlin, 2006)

work page 2006

[6] [6]

Levshina, Linguistic Typology26, 129 (2022)

N. Levshina, Linguistic Typology26, 129 (2022)

work page 2022

[7] [7]

G. B. Jenset and B. McGillivray,Quantitative historical linguistics: A corpus framework, Vol. 26 (Oxford University Press, 2017)

work page 2017

[8] [8]

Gamallo, J

P. Gamallo, J. R. Pichel, and I. Alegria, Physica A: Statistical Mechanics and its Applications 484, 152 (2017)

work page 2017

[9] [9]

C. H. Brown, E. W. Holman, S. Wichmann, and V. Velupillai, Language Typology and Universals61, 285 (2008)

work page 2008

[10] [10]

Wichmann, E

S. Wichmann, E. W. Holman, D. Bakker, and C. H. Brown, Physica A: Statistical Mechanics and its Applications389, 3632 (2010)

work page 2010

[11] [11]

Serva and F

M. Serva and F. Petroni, Europhysics Letters81, 68005 (2008)

work page 2008

[12] [12]

Petroni and M

F. Petroni and M. Serva, Journal of Statistical Mechanics: Theory and Experiment2008, P08012 (2008)

work page 2008

[13] [13]

Gamallo, J

P. Gamallo, J. R. Pichel, and I. Alegria, Information11, 181 (2020)

work page 2020

[14] [14]

Estarrona, I

A. Estarrona, I. Etxeberria, M. Padilla-Moyano, and A. Soraluze, Procesamiento del Lenguaje Natural70, 53 (2023)

work page 2023

[15] [15]

Marian, J

V. Marian, J. Bartolotti, S. Chabal, and A. Shook, PLOS ONE7, e43230 (2012)

work page 2012

[16] [16]

S. E. Eden,Measuring phonological distance between languages, Phd thesis, UCL (University College London) (2018)

work page 2018

[17] [17]

Lara-Mart´ ınez, B

P. Lara-Mart´ ınez, B. Obreg´ on-Quintana, C. Reyes-Manzano, I. L´ opez-Rodr´ ıguez, and L. Guzm´ an-Vargas, PLOS ONE17, e0274617 (2022)

work page 2022

[18] [18]

De Gregorio, R

J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024). 23

work page 2024

[19] [19]

De Marneffe, C

M.-C. De Marneffe, C. D. Manning, J. Nivre, and D. Zeman, Computational linguistics47, 255 (2021)

work page 2021

[20] [20]

Li, Journal of Quantitative Linguistics , 1 (2025)

W. Li, Journal of Quantitative Linguistics , 1 (2025)

work page 2025

[21] [21]

Jeszenszky, P

P. Jeszenszky, P. Stoeckle, E. Glaser, and R. Weibel, Journal of Linguistic Geography5, 86 (2017)

work page 2017

[22] [22]

J¨ ager, Scientific data5, 1 (2018)

G. J¨ ager, Scientific data5, 1 (2018)

work page 2018

[23] [23]

Wichmann, A

S. Wichmann, A. M¨ uller, and V. Velupillai, Diachronica27, 247 (2010)

work page 2010

[24] [24]

Bouckaert, P

R. Bouckaert, P. Lemey, M. Dunn, S. J. Greenhill, A. V. Alekseyenko, A. J. Drummond, R. D. Gray, M. A. Suchard, and Q. D. Atkinson, Science337, 957 (2012)

work page 2012

[25] [25]

Chang, C

W. Chang, C. Cathcart, D. Hall, and A. Garrett, Language91, 194 (2015)

work page 2015

[26] [26]

Heggarty, C

P. Heggarty, C. Anderson, M. Scarborough, B. King, R. Bouckaert, L. Jocz, M. J. K¨ ummel, T. J¨ ugel, B. Irslinger, R. Pooth,et al., Science381, eabg0818 (2023)

work page 2023

[27] [27]

S´ anchez, L

D. S´ anchez, L. Zunino, J. De Gregorio, R. Toral, and C. Mirasso, Chaos: An Interdisciplinary Journal of Nonlinear Science33, 033121 (2023)

work page 2023

[28] [28]

Christodouloupoulos and M

C. Christodouloupoulos and M. Steedman, Language resources and evaluation49, 375 (2015)

work page 2015

[29] [29]

Bible texts,

YouVersion, “Bible texts,”https://www.bible.com(2026), downloaded versions of biblical texts in multiple languages

work page 2026

[30] [30]

Bernard and H

M. Bernard and H. Titeux, Journal of Open Source Software6, 3958 (2021)

work page 2021

[31] [31]

D. R. Mortensen, S. Dalmia, and P. Littell, inProceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC, edited by N. C. C. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (European Language Resources Ass...

work page 2018

[32] [32]

Data and code for “Phonological distances for linguistic typology and the origin of Indo-European languages

M. Mavridis, “Data and code for “Phonological distances for linguistic typology and the origin of Indo-European languages”,”https://github.com/MariusMavridis/ Phonetic-Distances/(2026)

work page 2026

[33] [33]

J. L. Lee, L. F. Ashby, M. E. Garza, Y. Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman, inProceedings of the Twelfth Language Resources and Evaluation Conference, edited by N. Calzolari, F. B´ echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eu...

work page 2020

[34] [34]

Wiktionary, the free dictionary,

Wiktionary contributors, “Wiktionary, the free dictionary,”https://www.wiktionary.org/ (2026), online collaborative dictionary

work page 2026

[35] [35]

R. M. Dixon and A. Y. Aikhenvald,Word: A cross-linguistic typology(Cambridge University Press, 2003)

work page 2003

[36] [36]

A. E. Raftery, Journal of the Royal Statistical Society Series B: Statistical Methodology47, 528 (1985)

work page 1985

[37] [37]

J. P. Crutchfield and D. P. Feldman, Chaos: An Interdisciplinary Journal of Nonlinear Science 13, 25 (2003)

work page 2003

[38] [38]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Chaos36, 033124 (2026)

work page 2026

[39] [39]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Chaos, Solitons & Fractals165, 112797 (2022)

work page 2022

[40] [40]

Nemenman, F

I. Nemenman, F. Shafee, and W. Bialek, Advances in Neural Information Processing Systems 14, 471 (2001)

work page 2001

[41] [41]

De Gregorio, D

J. De Gregorio, D. S´ anchez, and R. Toral, Entropy26, 79 (2024)

work page 2024

[42] [42]

M. A. Kohler, W. D. Andrews, J. P. Campbell, and J. Herndndez-Cordero, inConference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat. No. 01CH37256), Vol. 2 (IEEE, 2001) pp. 1557–1561

work page 2001

[43] [43]

De Gregorio, R

J. De Gregorio, R. Toral, and D. S´ anchez, EPJ Data Science13, 61 (2024)

work page 2024

[44] [44]

D. R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, and L. S. Levin, inProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers(ACL, 2016) pp. 3475–3484

work page 2016

[45] [45]

L. V. Kantorovich, Management Science6, 366 (1960)

work page 1960

[46] [46]

V. M. Panaretos and Y. Zemel, Annual Review of Statistics and its Application6, 405 (2019)

work page 2019

[47] [47]

Rubner, C

Y. Rubner, C. Tomasi, and L. J. Guibas, inSixth international conference on computer vision (IEEE Cat. No. 98CH36271)(IEEE, 1998) pp. 59–66

work page 1998

[48] [48]

Levina and P

E. Levina and P. Bickel, inProceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 2 (IEEE, 2001) pp. 251–256

work page 2001

[49] [49]

B. M. Bolstad, R. A. Irizarry, M. ˚Astrand, and T. P. Speed, Bioinformatics19, 185 (2003)

work page 2003

[50] [50]

S. N. Evans and F. A. Matsen, Journal of the Royal Statistical Society Series B: Statistical Methodology74, 569 (2012)

work page 2012

[51] [51]

Alvarez-Melis and T

D. Alvarez-Melis and T. Jaakkola, inProceedings of the 2018 conference on empirical methods in natural language processing(2018) pp. 1881–1890. 25

work page 2018

[52] [52]

T. Louf, D. S´ anchez, and J. J. Ramasco, Physical Review Research3, 043146 (2021)

work page 2021

[53] [53]

Cuturi, Advances in neural information processing systems26(2013)

M. Cuturi, Advances in neural information processing systems26(2013)

work page 2013

[54] [54]

Flamary, N

R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. H. Gayraud, H. Janati, A. Rako- tomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer, Journal of Machine Learning Research22, 1 (2021)

work page 2021

[55] [55]

J. H. Ward Jr, Journal of the American statistical association58, 236 (1963)

work page 1963

[56] [56]

M. S. Dryer and M. Haspelmath, eds.,WALS Online (v2020.4)(Zenodo, 2013)

work page 2013

[57] [57]

Balto-slavic,

T. Pronk, “Balto-slavic,” inThe Indo-European Language Family, edited by T. Olander (Cam- bridge University Press, 2022) p. 269

work page 2022

[58] [58]

J. I. Hualde,Basque phonology(Routledge, 2004)

work page 2004

[59] [59]

Leppik and P

K. Leppik and P. Lippus, inXXVIII Fonetiikan p¨ aiv¨ at. Turku 25.-26. lokakuuta 2013. Kon- ferenssijulkaisu. Turku: Turun yliopisto(2014) pp. 19–26

work page 2013

[60] [60]

Feldhausen,Sentential form and prosodic structure of Catalan(John Benjamins Publishing Company, 2010)

I. Feldhausen,Sentential form and prosodic structure of Catalan(John Benjamins Publishing Company, 2010)

work page 2010

[61] [61]

J. P. Mallory and D. Q. Adams, inEncyclopedia of Indo-European Culture(Fitzroy Dearborn, London, 1997) pp. 8–11

work page 1997

[62] [62]

Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp

B. Tikkanen, inArchaeology and Language IV(Routledge, 2003) pp. 139–148

work page 2003

[63] [63]

Incidentally, we can employ a phonological distance calculation to quantitatively verify that our corpus is representative. Thus, we compute the distance between English probability distributions of the Bible and Herman Melville’s Moby Dick, a frequently analyzed text in computational and quantitative linguistics [W. Ebeling and T. Poschel, Europhysics Le...

work page 1994

[64] [64]

G. J. Sz´ ekely, M. L. Rizzo, and N. K. Bakirov, The Annals of Statistics35, 2769 (2007)

work page 2007

[65] [65]

Q. D. Atkinson, Science332, 346 (2011)

work page 2011

[66] [66]

Fort and J

J. Fort and J. P´ erez-Losada, Journal of The Royal Society Interface13, 20160185 (2016). 26

work page 2016

[67] [67]

T. F. Jaeger, P. Graff, W. Croft, and D. Pontillo, Linguistic Tipology15, 281 (2011)

work page 2011

[68] [68]

Hunley, C

K. Hunley, C. Bowern, and M. Healy, Proceedings of the Royal Society B: Biological Sciences 279, 2281 (2012)

work page 2012

[69] [69]

Balakrishnan and V

N. Balakrishnan and V. B. Nevzorov,A primer on statistical distributions(John Wiley & Sons, 2004) Chap. 27

work page 2004

[70] [70]

Gimbutas, Journal of Indo-European Studies1, 1 (1973)

M. Gimbutas, Journal of Indo-European Studies1, 1 (1973)

work page 1973

[71] [71]

J. P. Mallory,In search of the Indo-Europeans: Language, archaeology and myth(Thames and Hudson, 1989)

work page 1989

[72] [72]

Kroonen, A

G. Kroonen, A. Jakob, A. I. Palm´ er, P. van Sluis, and A. Wigman, PLOS ONE17, e0275744 (2022)

work page 2022

[73] [73]

Lazaridis, N

I. Lazaridis, N. Patterson, D. Anthony, L. Vyazov, R. Fournier, H. Ringbauer, I. Olalde, A. A. Khokhlov, E. P. Kitov, N. I. Shishlina,et al., Nature639, 132 (2025)

work page 2025

[74] [74]

Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)

C. Renfrew,Archaeology and language: the puzzle of Indo-European origins(CUP Archive, 1990)

work page 1990

[75] [75]

Labov,The Social Stratification of English in New York City(Cambridge University Press, Cambridge, UK, 1966)

W. Labov,The Social Stratification of English in New York City(Cambridge University Press, Cambridge, UK, 1966)

work page 1966

[76] [76]

Haspelmath, inLanguage typology and language universals.(Handb¨ ucher zur Sprach-und Kommunikationswissenschaft)(de Gruyter, 2001) pp

M. Haspelmath, inLanguage typology and language universals.(Handb¨ ucher zur Sprach-und Kommunikationswissenschaft)(de Gruyter, 2001) pp. 1492–1510

work page 2001

[77] [77]

C. P. Masica,Defining a linguistic area: South Asia(Orient Blackswan, 2005)

work page 2005

[78] [78]

Cysouw, inSpace in language and linguistics: Geographical, interactional, and cognitive perspectives(de Gruyter, 2013)

M. Cysouw, inSpace in language and linguistics: Geographical, interactional, and cognitive perspectives(de Gruyter, 2013)

work page 2013

[79] [79]

S. J. Greenhill, P. Heggarty, and R. D. Gray, The handbook of historical linguistics2, 226 (2020)

work page 2020

[80] [80]

Raymond G

J. Raymond G. Gordon, ed.,Ethnologue: Languages of the World, 15th ed. (SIL International, Dallas, TX, 2005)https://www.ethnologue.com. 27

work page 2005