WaST: a formalisation of the Wave model with associated statistical inference and applications

Gr\'egoire Clart\'e

arxiv: 2604.08220 · v2 · submitted 2026-04-09 · 📊 stat.AP

WaST: a formalisation of the Wave model with associated statistical inference and applications

Gr\'egoire Clart\'e This is my paper

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 📊 stat.AP

keywords wave modelBayesian inferencegraph diffusionhistorical linguisticsMetropolis-Hastingstrait evolutiondeath processpopulation contacts

0 comments

The pith

The wave model of trait spread among populations is formalized as a Bayesian process on a fixed contact graph with innovations decaying by death.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formalizes the wave model from historical linguistics as a generative statistical process suited to situations of ongoing contact between populations, such as dialects or cultural groups, where traits arise and diffuse along connections rather than branching in isolation. It constructs a fully Bayesian model in which new innovations appear, spread to neighboring populations according to an undirected graph, and then disappear independently according to a death process. A Metropolis-Hastings within Gibbs sampler is developed to draw from the posterior distribution over possible graphs given observed trait data. The approach is tested on simulated data and real linguistic examples to recover contact structures. This matters for fields studying joint evolution under permanent interaction, where tree models that assume no further contact after splits are inadequate.

Core claim

The wave model can be expressed as a fully Bayesian generative model in which innovations spread along the edges of a fixed undirected graph that encodes population contacts and disappear according to an independent death process; the posterior distribution on the graph is sampled using a Metropolis-Hastings within Gibbs algorithm.

What carries the argument

A fixed undirected graph representing permanent contacts, paired with a generative process of innovation birth followed by graph diffusion and exponential death, with posterior sampling performed by Metropolis-Hastings within Gibbs.

If this is right

The inferred graph supplies a quantitative reconstruction of historical or social contacts consistent with observed trait distributions.
Model comparison between the graph-based wave process and tree-based alternatives becomes possible on the same datasets.
Posterior samples over graphs allow uncertainty quantification when predicting future trait spread or reconstructing past contacts.
The framework applies directly to any domain where traits or innovations propagate through stable contact networks, including cultural transmission and certain epidemiological settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Allowing the graph itself to evolve or to incorporate weighted or directed edges could address cases where contact intensity varies.
The same generative structure might be applied to archaeological or genetic data to test for wave-like diffusion versus branching divergence.
If the death process is replaced by a more flexible survival model, the framework could capture traits that persist indefinitely once adopted.

Load-bearing premise

The contact structure among populations can be adequately represented by a single fixed undirected graph whose edges are the only routes of innovation spread, with no further structure or time-varying contacts required.

What would settle it

Trait presence data from populations whose spread patterns require time-varying contacts or diffusion routes outside any single fixed graph, such that the recovered posterior places negligible mass on graphs that match the observed sharing patterns.

Figures

Figures reproduced from arXiv: 2604.08220 by Gr\'egoire Clart\'e.

**Figure 1.** Figure 1: Example of graph Pi . Because of the independence of the other parts of the process, once this subset of edges is selected, this latent variable is used in the numerical part. Once reaching the new node, it keeps propagating on the graph to all the connected nodes that have not yet received it. Each node that has the trait can lose it following a standard death process with rate ν (eq. 5). Thus, a trait at… view at source ↗

**Figure 2.** Figure 2: Example of spread of a trait. The circle represents the reachable nodes at the time of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: True graphed of the simulated results: graphical representation and adjacency matrix. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Posterior distribution on the graph for the simulated dataset. Two independent reali [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Plot of the likelihood along the chain of a simulation of WaST on the simlated dataset [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Results from Kalyan et al. [2019] on the North Vanuatu languages on a geographical [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Results from WaST on the North Vanuatu dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Results from WaST on the North Vanuatu dataset with edges with posterior probability [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Results on the Greek dialects dataset. Top posterior on the individual edges, bottom [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Results on the Kra-Dai looms dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: WaST on the simulated example with the graph prior associated with [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Results on the simulated dataset with a prior on the total length of the graph with [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Result on the misspecified simulated dataset [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Two independent replicas of WaST on the morphological sub dataset of the North [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Two independent replicas of WaST on the Indo-European dataset [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Sites included in the Ceramic dataset 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Results of WaST on the Ceramic dataset. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Results of WaST on the Ceramic dataset with less than 0 [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

read the original abstract

We propose a mathematical formalisation of the ``wave model'' originally developed in historical linguistics but with further applications in human sciences. This model assumes new traits appear in a population and spread to nearby populations depending on their closeness. It is mostly used to describe joint evolution of closely related populations, for example of several dialects. These situations of permanent contact are not accurately represented by its competitors based on tree structures. We built a fully Bayesian generative model where innovation spread along a fixed graph and disappear according to a death process. We then develop a Metropolis-Hastings within Gibbs sampler to sample from the posterior distribution on the graph. We test our method on simulated datasets as well as on several real dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up a Bayesian generative model and MH-within-Gibbs sampler for inferring fixed contact graphs under the wave model, but the fixed-graph assumption and simulation-only validation leave its real-data performance unclear.

read the letter

The paper formalizes the wave model as a generative process in which traits appear in one population, spread only along edges of a single fixed undirected graph, and then disappear according to an independent death process. It supplies a Metropolis-Hastings-within-Gibbs sampler that targets the posterior over graphs given observed trait patterns across populations. This is the concrete new piece: a usable statistical procedure for a setting where tree models are known to be a poor fit because of ongoing contact between groups such as dialects. The generative story is cleanly stated and the sampler is tailored to the graph posterior rather than a generic extension of earlier work. That combination is worth having on record for people who need to move beyond trees in cultural-evolution data. The tests on data simulated from the model itself and on a handful of real linguistic datasets are a reasonable start. The main limitation is the strong modeling assumption that one static undirected graph plus death rates is enough to explain the observed patterns. If contacts are time-varying, directed, or involve simultaneous borrowing from non-adjacent sources, the posterior on G will be a biased summary rather than a faithful reconstruction. The stress-test note is on target here: because the simulation studies stay inside the assumed generative process, they do not yet show how the method behaves under mild misspecification. Without seeing the actual convergence diagnostics, prior specifications, and quantitative recovery results on the real datasets, it is difficult to judge whether the inferred graphs are stable or linguistically plausible. This work is aimed at statisticians and linguists who already work with graph-based models of contact. A reader who needs an off-the-shelf alternative to trees for permanent-contact data would get something concrete to try. It deserves peer review; the core construction is coherent and the sampler is specified enough that referees can check the details and ask for the missing robustness checks.

Referee Report

3 major / 2 minor

Summary. The paper proposes WaST, a Bayesian generative model formalizing the wave model from historical linguistics. Innovations arise in populations and propagate only along edges of a fixed undirected graph representing contact structure, while vanishing according to an independent death process. A Metropolis-Hastings-within-Gibbs sampler is developed to target the posterior over graphs given observed trait patterns across populations. The method is evaluated on data simulated from the model itself and applied to several real linguistic datasets.

Significance. If the fixed-graph assumption holds and the sampler mixes adequately, the work supplies a fully Bayesian, generative alternative to tree-based models for settings with permanent contact among populations. The explicit construction of the likelihood via graph diffusion plus death process, together with the MH-within-Gibbs procedure, constitutes a concrete statistical contribution that could be reused in dialectology and related diffusion studies. The provision of an implementable sampler is a clear strength.

major comments (3)

[§4] §4 (Simulation study): All simulated datasets are generated from the identical fixed-undirected-graph plus death process that defines the likelihood. Consequently the reported recovery of the true G does not test whether the posterior concentrates on a meaningful contact graph when the data-generating process deviates from the model (time-varying edges, directed transmission, or simultaneous borrowing from non-adjacent sources). This is load-bearing for the claim that the inferred graph represents the true contact structure on real data.
[§3] §3 (Sampler): No prior specification is given for the graph (e.g., Erdős–Rényi, degree-corrected, or sparsity-inducing), nor for the death-rate parameters. Without these, the posterior p(G|data) is not fully defined and the reported sampler cannot be reproduced or diagnosed for convergence.
[§5] §5 (Real-data applications): No quantitative comparison is supplied against tree-based or other competing models (e.g., via posterior predictive checks or marginal likelihood). The advantage over tree structures asserted in the introduction therefore remains unquantified on the datasets actually analyzed.

minor comments (2)

Notation for the death process and the innovation arrival times is introduced without an explicit generative equation; adding a short algorithmic box or pseudocode would improve clarity.
The abstract states that the method is tested on simulated and real datasets but supplies no numerical summaries (recovery rates, posterior edge probabilities, effective sample sizes). These should be added to the abstract or a results table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and describe the revisions we will implement to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Simulation study): All simulated datasets are generated from the identical fixed-undirected-graph plus death process that defines the likelihood. Consequently the reported recovery of the true G does not test whether the posterior concentrates on a meaningful contact graph when the data-generating process deviates from the model (time-varying edges, directed transmission, or simultaneous borrowing from non-adjacent sources). This is load-bearing for the claim that the inferred graph represents the true contact structure on real data.

Authors: We agree that the simulation study validates sampler performance under correct specification but does not probe robustness to misspecification. In the revised manuscript we will add a dedicated paragraph in §4 that (i) explicitly states the simulation is an in-sample recovery check, (ii) discusses the wave-model assumptions required for the inferred graph to represent contact structure, and (iii) outlines how violations (e.g., time-varying edges) would affect posterior concentration. A small additional simulation under mild misspecification will also be included to illustrate sensitivity. These changes will qualify the claims on real-data interpretation without altering the core methodological contribution. revision: partial
Referee: [§3] §3 (Sampler): No prior specification is given for the graph (e.g., Erdős–Rényi, degree-corrected, or sparsity-inducing), nor for the death-rate parameters. Without these, the posterior p(G|data) is not fully defined and the reported sampler cannot be reproduced or diagnosed for convergence.

Authors: The referee correctly identifies an omission. We will revise §3 to state the priors explicitly: an Erdős–Rényi prior on the undirected graph with fixed edge probability p = 0.5, and independent Gamma(1,1) priors on the death-rate parameters. The full conditional distributions and Metropolis-Hastings proposal kernels will be written out, ensuring the posterior is completely defined and the sampler is reproducible. Convergence diagnostics will also be reported for the real-data runs. revision: yes
Referee: [§5] §5 (Real-data applications): No quantitative comparison is supplied against tree-based or other competing models (e.g., via posterior predictive checks or marginal likelihood). The advantage over tree structures asserted in the introduction therefore remains unquantified on the datasets actually analyzed.

Authors: We accept that a direct quantitative comparison would be desirable. Full implementation of competing tree-based models with equivalent likelihoods for the same datasets lies outside the scope of the present methodological paper. In revision we will augment §5 with posterior predictive checks that quantify how well the fitted graph model reproduces observed trait co-occurrence patterns that are incompatible with strict tree structures. These checks will provide a concrete, quantitative measure of model adequacy on the real linguistic datasets while preserving the paper’s focus on the new generative construction. revision: partial

Circularity Check

0 steps flagged

No circularity: generative model and sampler are defined independently of fitted outputs

full rationale

The paper defines a generative Bayesian model in which innovations appear, propagate along edges of a fixed undirected graph, and vanish via an independent death process; the Metropolis-Hastings-within-Gibbs sampler is then constructed to target the posterior p(G | data) under that explicit likelihood. No equation or step equates a derived quantity to a fitted parameter by construction, nor does any prediction reduce to a renaming of the input data or to a self-citation chain. Validation on data simulated from the same process is the standard check that the sampler recovers the known generating graph and does not create circularity. The central formalization therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; therefore the ledger is necessarily incomplete. The central modeling choice is treated as a domain assumption rather than derived.

axioms (1)

domain assumption Innovations spread along a fixed graph and disappear according to a death process.
This is the core modeling assumption stated in the abstract as the basis for the generative model.

pith-pipeline@v0.9.0 · 5410 in / 1386 out tokens · 100288 ms · 2026-05-10T17:44:36.463779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Understanding uncertainty in bayesian cluster analysis.arXiv preprint arXiv:2506.16295,

Cecilia Balocchi and Sara Wade. Understanding uncertainty in bayesian cluster analysis.arXiv preprint arXiv:2506.16295,

work page arXiv
[2]

Hawe, Fabian J

Johann S. Hawe, Fabian J. Theis, and Matthias Heinig. Inferring interaction networks from multi-omics data.Frontiers in Genetics, Volume 10 - 2019,

work page 2019
[3]

Overall, some feature appear clearly: Celtic languages are grouped together, Scandinavian languages as well, but the overall position of the languages appear unclear. This can be explained by the choice of the languages: French in our dataset is the only romance language, grouping it with any of the other languages seem uncertain, but it has to be grouped...

work page 2023

[1] [1]

Understanding uncertainty in bayesian cluster analysis.arXiv preprint arXiv:2506.16295,

Cecilia Balocchi and Sara Wade. Understanding uncertainty in bayesian cluster analysis.arXiv preprint arXiv:2506.16295,

work page arXiv

[2] [2]

Hawe, Fabian J

Johann S. Hawe, Fabian J. Theis, and Matthias Heinig. Inferring interaction networks from multi-omics data.Frontiers in Genetics, Volume 10 - 2019,

work page 2019

[3] [3]

Overall, some feature appear clearly: Celtic languages are grouped together, Scandinavian languages as well, but the overall position of the languages appear unclear. This can be explained by the choice of the languages: French in our dataset is the only romance language, grouping it with any of the other languages seem uncertain, but it has to be grouped...

work page 2023