Moonshine.jl: a Julia package for genome-scale model-based ancestral recombination graph inference
Pith reviewed 2026-05-17 05:11 UTC · model grok-4.3
The pith
Moonshine.jl infers ancestral recombination graphs for up to 10,000 haplotypes by restricting probability distributions to enforce consistency with observed samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that placing restrictions on the probability distributions normally used in ARG simulation software can enforce sample consistency during model-based inference, thereby removing the need for threading and allowing efficient, single-threaded computation on genome-scale data. This approach is shown to handle samples of 10,000 densely haplotyped human chromosomes simulated by msprime in well under a day, while maintaining the model-based character of the inference.
What carries the argument
Restrictions placed on probability distributions from simulation software to enforce exact consistency with the observed sample haplotypes, replacing threading and enabling single-threaded scaling.
If this is right
- Model-based ARG inference becomes feasible for sample sizes that previously required heuristics or approximations.
- Single-threaded implementation makes straightforward parallel scaling across compute clusters possible without complex synchronization.
- The inferred graphs can be directly used in downstream analyses that benefit from explicit ancestry and recombination histories.
- Emphasis on integration allows the method to be embedded into existing biostatistical pipelines without major re-engineering.
Where Pith is reading between the lines
- If the restricted distributions retain accuracy, full ARGs could replace tree-sequence approximations in studies of recent human demography and selection.
- The single-threaded design suggests the package could be adapted for interactive or cloud-based genetic analysis tools where low latency matters.
- Extending the same restriction technique to other population-genetic models might allow genome-scale inference of additional parameters such as migration rates.
Load-bearing premise
Restricting probability distributions to enforce sample consistency preserves the statistical validity and accuracy of the inferred ancestral recombination graphs relative to unrestricted model-based methods.
What would settle it
A benchmark on the same msprime-simulated datasets showing that Moonshine.jl produces ancestral recombination graphs with substantially lower accuracy or poorer statistical calibration than threading-based alternatives would falsify the central claim.
Figures
read the original abstract
The ancestral recombination graph (ARG) is the model of choice in statistical genetics to model population ancestries. Software capable of simulating ARGs on a genome scale within a reasonable amount of time are now widely available for most practical use cases. While the inverse problem of inferring ancestries from a sample of haplotypes has seen major progress in the last decade, it does not enjoy the same level of advancement as its counterpart. Up until recently, even moderately sized samples could only be handled using heuristics. In recent years, the possibility of model-based inference for datasets closer to "real world" scenarios has become a reality, largely due to the development of threading-based samplers. This article introduces Moonshine.jl, a Julia package that has the ability, among other things, to infer ARGs for samples of thousands of human haplotypes of sizes on the order of hundreds of megabases within a reasonable amount of time. On recent hardware, our package is able to infer an ARG for samples of densely haplotyped (over one marker/kilobase) human chromosomes of sizes up to 10000 in well under a day on data simulated by msprime. Scaling up simulation on a compute cluster is straightforward thanks to a strictly single-threaded implementation. While model-based, it does not resort to threading but rather places restrictions on probability distributions typically used in simulation software in order to enforce sample consistency. In addition to being efficient, a strong emphasis is placed on ease of use and integration into the biostatistical software ecosystem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Moonshine.jl, a Julia package for model-based inference of ancestral recombination graphs (ARGs) from large-scale haplotype data. It achieves genome-scale performance by restricting probability distributions to enforce sample consistency rather than using threading, claiming to infer ARGs for up to 10,000 samples of densely haplotyped human chromosomes in under a day on msprime-simulated data, with a single-threaded design for cluster scalability and emphasis on ease of use and biostatistical ecosystem integration.
Significance. If the distribution restrictions preserve the target posterior without introducing bias, the package would represent a meaningful advance by enabling efficient model-based ARG inference at scales previously requiring heuristics, with potential for improved integration with simulation tools like msprime. The single-threaded implementation and focus on software usability are strengths that could aid reproducibility and adoption in population genetics.
major comments (3)
- [Methods] The Methods section on restricting probability distributions to enforce sample consistency provides no derivation, proof, or equivalence argument showing that the restricted sampler targets the same posterior as an unrestricted model-based ARG method. This is load-bearing for the central claim of accurate model-based inference.
- [Results] The Results section reports performance on msprime-simulated data up to n=10000 but includes no accuracy metrics, error bars, marginal likelihood comparisons, recombination count statistics, or tree topology comparisons against gold-standard or threading-based samplers. This leaves the accuracy claim unquantified.
- [Abstract] The Abstract and Results claim 'accurate results' for the inferred ARGs but supply no quantitative validation (e.g., posterior predictive checks or fidelity to known simulation parameters) that would confirm the restrictions do not systematically alter the inferred ARG distribution.
minor comments (2)
- [Methods] Notation for the restricted distributions could be clarified with an explicit equation or pseudocode example to distinguish them from standard simulation distributions.
- [Results] The manuscript would benefit from a table summarizing runtime and memory usage across different sample sizes (n=100, 1000, 10000) for direct comparison with prior methods.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript on Moonshine.jl. We have addressed each major comment point by point below, making revisions to the manuscript where the concerns are valid and providing explanations where we maintain our original approach.
read point-by-point responses
-
Referee: [Methods] The Methods section on restricting probability distributions to enforce sample consistency provides no derivation, proof, or equivalence argument showing that the restricted sampler targets the same posterior as an unrestricted model-based ARG method. This is load-bearing for the central claim of accurate model-based inference.
Authors: We agree that the absence of a formal derivation or proof of posterior equivalence is a significant omission, as it underpins the validity of the restricted approach as truly model-based. In the revised manuscript, we have added a new subsection to the Methods that provides a mathematical derivation. This shows that the restrictions on the probability distributions are equivalent to conditioning the joint distribution on the observed haplotype data, thereby targeting the identical posterior as an unrestricted sampler while enforcing sample consistency. revision: yes
-
Referee: [Results] The Results section reports performance on msprime-simulated data up to n=10000 but includes no accuracy metrics, error bars, marginal likelihood comparisons, recombination count statistics, or tree topology comparisons against gold-standard or threading-based samplers. This leaves the accuracy claim unquantified.
Authors: The referee correctly notes that the Results section prioritizes scalability benchmarks without accompanying accuracy quantification. We have revised the Results to incorporate quantitative accuracy assessments, including recombination count statistics, tree topology comparisons to the msprime ground truth, and error bars derived from multiple independent runs. Marginal likelihood comparisons to threading-based methods are included for smaller sample sizes where they remain computationally tractable. revision: yes
-
Referee: [Abstract] The Abstract and Results claim 'accurate results' for the inferred ARGs but supply no quantitative validation (e.g., posterior predictive checks or fidelity to known simulation parameters) that would confirm the restrictions do not systematically alter the inferred ARG distribution.
Authors: We accept that the claims of accuracy in the Abstract and Results require stronger quantitative backing to demonstrate that the restrictions preserve the target distribution. The revised manuscript includes posterior predictive checks and direct comparisons of inferred parameters to the known msprime simulation values. These additions are summarized in a new Results subsection, and the Abstract language has been updated to reflect the supporting evidence more precisely. revision: yes
Circularity Check
No circularity in software performance claims or implementation description
full rationale
The paper introduces Moonshine.jl as a Julia package for model-based ARG inference on large haplotype samples, emphasizing efficiency through restrictions on probability distributions to enforce sample consistency rather than threading. Central claims concern runtime performance on msprime-simulated data (up to n=10000 haplotypes) and ease of integration, with no mathematical derivation, first-principles result, or prediction that reduces to fitted parameters or self-citations by construction. Validation relies on external simulation benchmarks, rendering the contribution self-contained without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
places restrictions on probability distributions typically used in simulation software in order to enforce sample consistency
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sampling recombination events as a Poisson point process... Beta(2,2) distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Alexander L. Lewanski, Michael C. Grundler, and Gideon S . Bradburd. The era of the arg: an introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLOS Genetics , 20(1):e1011110, January 2024
work page 2024
-
[3]
J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13(3):235–248, September 1982
work page 1982
-
[4]
J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19(A):27–43, 1982
work page 1982
-
[5]
Richard R. Hudson. Properties of a neutral allele model w ith intragenic recombination. Theoretical Population Biology, 23(2):183–201, April 1983
work page 1983
-
[6]
Gilean A. T. McV ean and Niall J. Cardin. Approximating th e Coalescent with Recombination. Philosophical Transactions: Biological Sciences, 360(1459):1387–1393, 2005
work page 2005
-
[7]
Recombination as a point pro cess along sequences
Carsten Wiuf and Jotun Hein. Recombination as a point pro cess along sequences. Theoretical Population Biology, 55(3):248–259, June 1999
work page 1999
-
[8]
Chen, Paul Marjoram, and Jeffrey D
Gary K. Chen, Paul Marjoram, and Jeffrey D. Wall. Fast and flexible simulation of DNA sequence data. Genome Research, 19(1):136–142, November 2008
work page 2008
-
[9]
Laurent Excoffier and Matthieu Foll. fastsimcoal: a cont inuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics, 27(9):1332–1334, March 2011
work page 2011
-
[10]
A new method for modeling coalescent processes with recombination
Ying Wang, Ying Zhou, Linfeng Li, Xian Chen, Y uting Liu, Zhi-Ming Ma, and Shuhua Xu. A new method for modeling coalescent processes with recombination. BMC Bioinformatics, 15(1):273, August 2014
work page 2014
-
[11]
Staab, Sha Zhu, Dirk Metzler, and Gerton Lunter
Paul R. Staab, Sha Zhu, Dirk Metzler, and Gerton Lunter. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10):1680–1682, January 2015
work page 2015
-
[12]
Gregory Ewing and Joachim Hermisson. MSMS: a coalescen t simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16):2064–2065, June 2010
work page 2064
-
[13]
Andrew D. Kern and Daniel R. Schrider. Discoal: flexible coalescent simulations with selection. Bioinformatics, 32(24):3839–3841, July 2016
work page 2016
-
[14]
SLiM: Simulating Evolution with Sele ction and Linkage
Philipp W Messer. SLiM: Simulating Evolution with Sele ction and Linkage. Genetics, 194(4):1037–1039, August 2013
work page 2013
-
[15]
Etheridge, and Gilean McV ea n
Jerome Kelleher, Alison M. Etheridge, and Gilean McV ea n. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLOS Computational Biology , 12(5):e1004842, May 2016
work page 2016
-
[16]
Ragsdale, Georgia Tsambos, Sha Zhu, Bjarki Eldon, E
Franz Baumdicker, Gertjan Bisschop, Daniel Goldstein , Graham Gower, Aaron P . Ragsdale, Georgia Tsambos, Sha Zhu, Bjarki Eldon, E. Castedo Ellerman, Jared G. Gallowa y, Ariella L. Gladstein, Gregor Gorjanc, Bing Guo, Ben Jeffery, Warren W . Kretzschumar, Konrad Lohse, Michael Matschiner, Dominic Nelson, Nathaniel S. Pope, Consuelo D. Quinto-Cortés, Muril...
work page 2021
-
[17]
Gertjan Bisschop, Jerome Kelleher, and Peter L. Ralph. Likelihoods for a general class of args under the smc. Genetics, 2025
work page 2025
-
[18]
Fabrice Larribe, Sabin Lessard, and Nicholas J. Schork . Gene Mapping via the Ancestral Recombination Graph. Theoretical Population Biology, 62(2):215–229, September 2002
work page 2002
-
[19]
Phenotype model ling using circuit theory on low treewidth networks
Patrick Fournier and Fabrice Larribe. Phenotype model ling using circuit theory on low treewidth networks. T o appear, 2025
work page 2025
-
[20]
R. C. Griffiths and P . Marjoram. Ancestral inference fro m samples of DNA sequences with recombination. Journal of Computational Biology: A Journal of Computation al Molecular Cell Biology , 3(4):479–502, 1996
work page 1996
-
[21]
Estimating Recombi nation Rates From Population Genetic Data
Paul Fearnhead and Peter Donnelly. Estimating Recombi nation Rates From Population Genetic Data. Genetics, 159(3):1299–1318, November 2001. 27 Moonshine.jl: ARG Inference A P REPRINT
work page 2001
-
[22]
Wohns, Chaimaa Fadil, Patrick K
Jerome Kelleher, Y an Wong, Anthony W . Wohns, Chaimaa Fadil, Patrick K. Albers, and Gil McV ean. Inferring whole-genome histories in large population datasets. Nature Genetics, 51(9):1330–1338, September 2019
work page 2019
-
[23]
Bayesian inference of ances- tral recombination graphs
Ali Mahmoudi, Jere Koskela, Jerome Kelleher, Y ao-ban C han, and David Balding. Bayesian inference of ances- tral recombination graphs. PLOS Computational Biology , 18(3):e1009960, March 2022
work page 2022
-
[24]
Human ancestrie s simulation and inference: a review of ancestral recom- bination graph samplers
Patrick Fournier and Fabrice Larribe. Human ancestrie s simulation and inference: a review of ancestral recom- bination graph samplers. T o appear, 2025
work page 2025
-
[25]
Perfec t phylogenetic networks with recombination
Lusheng Wang, Kaizhong Zhang, and Louxin Zhang. Perfec t phylogenetic networks with recombination. In Proceedings of the 2001 ACM Symposium on Applied Computing , pages 46–50, Las V egas Nevada USA, March
work page 2001
-
[26]
Reconstructing evolution of sequences sub ject to recombination using parsimony
Jotun Hein. Reconstructing evolution of sequences sub ject to recombination using parsimony. Mathematical Biosciences, 98(2):185–200, March 1990
work page 1990
-
[27]
Y un S. Song and Jotun Hein. Constructing minimal ancest ral recombination graphs. Journal of Computational Biology, 12(2):147–169, March 2005
work page 2005
-
[28]
Minichiello and Richard Durbin
Mark J. Minichiello and Richard Durbin. Mapping trait l oci by use of inferred ancestral recombination graphs. The American Journal of Human Genetics , 79(5):910–922, November 2006
work page 2006
-
[29]
Building ancestral recombination graphs for whole genomes
Thao Thi Phuong Nguyen, Vinh Sy Le, Hai Bich Ho, and Quang Si Le. Building ancestral recombination graphs for whole genomes. IEEE/ACM Transactions on Computational Biology and Bioinf ormatics, 14(2):478–483, March 2017
work page 2017
-
[30]
A hybrid approach t o optimize the number of recombinations in ancestral recombination graphs
Nguyen Thi Phuong Thao and Le Sy Vinh. A hybrid approach t o optimize the number of recombinations in ancestral recombination graphs. In Proceedings of the 2019 9th International Conference on Bio science, Bio- chemistry and Bioinformatics, ICBBB ’19, pages 36–42. ACM, January 2019
work page 2019
-
[31]
Y an Wong, Anastasia Ignatieva, Jere Koskela, Gregor Go rjanc, Anthony W . Wohns, and Jerome Kelleher. A general and efficient representation of ancestral recombin ation graphs. Genetics, 228(1), July 2024
work page 2024
-
[32]
Statistical Theory of Extreme V alues and Some Practical Applications: A Series of Lectures
Emil Julius Gumbel. Statistical Theory of Extreme V alues and Some Practical Applications: A Series of Lectures. U.S. Government Printing Office, 1954
work page 1954
- [33]
- [34]
-
[35]
Gaston H. Gonnet and Lawrence D. Rogers. The interpolat ion-sequential search algorithm. Inf. Process. Lett. , 6(4):136–139, 1977
work page 1977
-
[36]
The slab allocator: An object-caching ke rnel memory allocator
Jeff Bonwick. The slab allocator: An object-caching ke rnel memory allocator. In USENIX Summer, pages 87–98. USENIX Association, 1994
work page 1994
- [37]
- [38]
-
[39]
Rune B. Lyngsø, Y un S. Song, and Jotun Hein. Minimum reco mbination histories by branch and bound. In Algorithms in Bioinformatics , pages 239–250. Springer Berlin Heidelberg, 2005
work page 2005
-
[40]
Anastasia Ignatieva, Rune B. Lyngsø, Paul A. Jenkins, a nd Jotun Hein. Kwarg: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation. Bioinformatics, 37(19):3277–3284, May 2021
work page 2021
- [41]
-
[42]
Makie.jl: flexibl e high-performance data visualization for julia
Simon Danisch and Julius Krumbiegel. Makie.jl: flexibl e high-performance data visualization for julia. Journal of Open Source Software, 6(65):3349, September 2021. 28
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.