A mathematical framework for centromere-aware evaluation of human genome assemblies
Pith reviewed 2026-06-27 10:36 UTC · model grok-4.3
The pith
Centromere quality in genome assemblies can be scored by comparing distributions of inter-motif distances with KL divergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Centromere assembly evaluation reduces to a comparative distribution problem in centeny representation: genomic distances between functional motifs are collected, the resulting distributions are compared by KL divergence, and the resulting scores rank both whole genomes and individual chromosomes among available human T2T assemblies.
What carries the argument
Centeny representation, which converts centromeres into distributions of inter-motif genomic distances, with KL divergence used to measure agreement between query and target distributions.
If this is right
- Available T2T human genomes receive an overall accuracy ranking based on centromere fidelity.
- Each chromosome within those genomes receives its own centromere-specific accuracy score.
- Repetitive centromeric regions become benchmarkable without reliance on sequence alignment.
- A quantitative, rapid scoring system for assembly integrity in repetitive DNA is established.
- Chromosome-level comparisons between genomes gain a distribution-based standard.
Where Pith is reading between the lines
- The same distance-distribution approach could be applied to other large repetitive elements such as telomeres.
- The metric might be validated by checking whether its rankings predict centromere activity in cellular assays.
- If motif spacing proves more diagnostic than sequence content, assembly algorithms could be redesigned to prioritize spacing fidelity.
- The framework offers a route to compare centromeres across species once functional motifs are identified.
Load-bearing premise
Genomic distances between functional motifs in the centeny representation are sufficient to assess functional agreement between query and target centromeres without needing sequence-level information.
What would settle it
Experimental evidence that two centromeres with substantially different inter-motif distance distributions nevertheless produce equivalent kinetochore formation and chromosome segregation would falsify the metric.
read the original abstract
Accurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames centromere assembly evaluation as a distribution comparison problem using a centeny representation of inter-motif genomic distances, then applies KL divergence between query and target distributions to produce accuracy rankings for entire T2T human genomes and individual chromosomes.
Significance. If shown to track technical assembly integrity rather than biological variation, the approach could supply an alignment-free quantitative standard for repetitive regions where conventional metrics fail.
major comments (2)
- [Abstract] Abstract: the central claim that the KL-divergence metric 'yields an accuracy ranking' for T2T assemblies rests on an untested assumption that inter-motif distance differences are dominated by assembly errors; the manuscript supplies no benchmark against documented centromeric misassemblies, simulated motif-order perturbations, or orthogonal quality signals (e.g., read-depth or Hi-C consistency).
- [Abstract] Abstract and framework description: because the compared T2T genomes derive from different individuals, any observed ranking could equally reflect haplotype-specific variation in repeat-array length or motif spacing; no analysis separates these sources, rendering the accuracy interpretation circular without external validation.
Simulated Author's Rebuttal
We thank the referee for these constructive comments on the manuscript. We address each major point below and indicate the revisions that will be incorporated.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the KL-divergence metric 'yields an accuracy ranking' for T2T assemblies rests on an untested assumption that inter-motif distance differences are dominated by assembly errors; the manuscript supplies no benchmark against documented centromeric misassemblies, simulated motif-order perturbations, or orthogonal quality signals (e.g., read-depth or Hi-C consistency).
Authors: We agree that direct validation is needed to support the accuracy interpretation. The manuscript currently presents the KL-divergence on centeny representations as a comparative metric without explicit tests against known errors. In revision we will add a dedicated validation subsection that includes (i) controlled simulations of motif-order swaps and array-length perturbations and (ii) comparison of the resulting KL scores against available read-depth and Hi-C consistency annotations for the same assemblies. These additions will quantify the metric's sensitivity to assembly artifacts. revision: yes
-
Referee: [Abstract] Abstract and framework description: because the compared T2T genomes derive from different individuals, any observed ranking could equally reflect haplotype-specific variation in repeat-array length or motif spacing; no analysis separates these sources, rendering the accuracy interpretation circular without external validation.
Authors: This observation is correct and highlights a genuine interpretive limit. Because the assemblies come from distinct donors, observed differences in inter-motif distance distributions necessarily conflate technical assembly discrepancies with natural haplotype variation. We will revise the abstract and discussion to reframe the output explicitly as a measure of structural concordance with a chosen target assembly rather than an absolute accuracy ranking. We will also note that full separation of biological from technical sources would require same-individual multi-assembly datasets, which are not yet available for these centromeres. revision: partial
Circularity Check
No circularity detected; metric is a direct definition applied to data
full rationale
The paper defines its centeny representation and KL-divergence metric directly from inter-motif genomic distances as a comparative distribution problem. Applying this defined metric to T2T assemblies produces rankings by construction of the definition itself, without any reduction of a claimed prediction to a fitted parameter, self-citation chain, or imported uniqueness theorem. No equations or steps in the provided text exhibit self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via citation. The framework is self-contained as an explicit numerical rendering and comparison method.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chronos-2: From univariate to universal forecasting
Ansari, A. F. et al. “Chronos-2: From univariate to universal forecasting”. In:arXiv(2025)
2025
-
[2]
AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model
Avsec, Ž. et al. “AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model”. In:bioRxiv(2025), pp. 2025–06. DOI:10.1101/2025.06.25.661532
-
[3]
Balzano, E. and Giunta, S.Centromeres under pressure: evolutionary innovation in conflict with con- served function. Genes (Basel) 11, 912. 2020. DOI:10.3390/genes11080912
-
[4]
Genome modeling and design across all domains of life with Evo 2
Brixi, G. et al. “Genome modeling and design across all domains of life with Evo 2”. In:bioRxiv(2025), pp. 2025–02. DOI:10.1101/2025.02.18.638918
-
[5]
Chromosome-specific centromeric patterns define the centeny map of the human genome
Corda, L. and Giunta, S. “Chromosome-specific centromeric patterns define the centeny map of the human genome”. In:Science389.6755 (2025), eads3484. DOI:10.1126/science.ads3484
-
[6]
Cell line-matched reference enables high-precision functional genomics
Corda, L. et al. “Cell line-matched reference enables high-precision functional genomics”. In:Nature Communications(2025). DOI:10.1038/s41467-025-66155-3
-
[7]
Earnshaw, W. C. and Rothfield, N. “Identification of a family of human centromere proteins using autoimmune sera from patients with scleroderma”. In:Chromosoma91.3 (1985), pp. 313–321. DOI:10.1007/BF00328227
-
[8]
Aglobalviewofhumancentromerevariationandevolution
Gao,S.etal.“Aglobalviewofhumancentromerevariationandevolution”.In:bioRxiv(2025),pp.2025–
2025
-
[9]
DOI:10.64898/2025.12.09.693231
-
[10]
40 years of CENP-A: the foundation of a new era of centromere biology
Giunta, S. et al. “40 years of CENP-A: the foundation of a new era of centromere biology”. In:Chro- mosome Research33.1 (2025), p. 32. DOI:10.1007/s10577-025-09790-2
-
[11]
A complete diploid human genome benchmark for personalized genomics
Hansen, N. F. et al. “A complete diploid human genome benchmark for personalized genomics”. In: biorxiv(2025). DOI:10.1101/2025.09.21.677443
-
[12]
The centromere paradox: stable inheritance with rapidly evolving DNA
Henikoff, S., Ahmad, K., and Malik, H. S. “The centromere paradox: stable inheritance with rapidly evolving DNA”. In:Science293.5532 (2001), pp. 1098–1102. DOI:10.1126/science.1062939
-
[13]
Semi-automated assembly of high-quality diploid human reference genomes
Jarvis, E. D. et al. “Semi-automated assembly of high-quality diploid human reference genomes”. In: Nature611.7936 (2022), pp. 519–531. DOI:10.1038/s41586-022-05325-5
-
[14]
Ji, Y. et al. “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome”. In:Bioinformatics37.15 (2021), pp. 2112–2120. DOI:10.1093/bioinformatics/btab083
-
[15]
Initial sequencing and analysis of the human genome
Lander, E. et al. “Initial sequencing and analysis of the human genome”. In:Nature409.6822 (2001), pp. 860–921. DOI:10.1038/35057062
-
[16]
The variation and evolution of complete human centromeres
Logsdon, G. A. et al. “The variation and evolution of complete human centromeres”. In:Nature629.8010 (2024), pp. 136–145. DOI:10.1101/2023.05.30.542849
-
[17]
Centromere studies in the era of ‘telomere-to-telomere’genomics
Miga, K. H. “Centromere studies in the era of ‘telomere-to-telomere’genomics”. In:Experimental cell research394.2 (2020), p. 112127. DOI:10.1016/j.yexcr.2020.112127
-
[18]
Variation and evolution of human centromeres: a field guide and perspective
Miga, K. H. and Alexandrov, I. A. “Variation and evolution of human centromeres: a field guide and perspective”. In:Annual review of genetics55.1 (2021), pp. 583–602. DOI:10.1146/annurev-genet-071719-020519
-
[19]
and Mikheenko, Alla and Vollger, Mitchell R
Nurk, S. et al. “The complete sequence of a human genome”. In:Science376.6588 (2022), pp. 44–53. DOI:10.1126/science.abj6987
-
[20]
The diploid reference genome of a human embryonic stem cell line
Pačar, I. et al. “The diploid reference genome of a human embryonic stem cell line”. In:bioRxiv(2026), pp. 2026–03. DOI:10.64898/2026.03.26.714432
-
[21]
Ranallo-Benavidez, T. R. et al. “Haplotype-resolved genome assemblies of BJ and IMR-90 human fibroblast cell lines reveal extensive structural variation and enable reanalysis of historical sequencing data”. In:Nucleic acids research54.8 (2026), gkag333. DOI:10.1093/nar/gkag333
-
[22]
Rautiainen, M. et al. “Telomere-to-telomere assembly of diploid chromosomes with Verkko”. In:Nature biotechnology41.10 (2023), pp. 1474–1482. DOI:10.1038/s41587-023-01662-6. 20
-
[23]
ThereferencegenomeofthehumandiploidcelllineRPE-1
Volpe,E.etal.“ThereferencegenomeofthehumandiploidcelllineRPE-1”.In:Nature Communications 16.1 (2025), p. 7751. DOI:10.1038/s41467-025-62428-z
-
[24]
The Human Pangenome Project: a global resource to map genomic diversity
Wang, T. et al. “The Human Pangenome Project: a global resource to map genomic diversity”. In: Nature604.7906 (2022), pp. 437–446. DOI:10.1038/s41586-022-04601-8
-
[25]
Hierarchical order in chromosome-specific human alpha satellite DNA
Willard, H. F. and Waye, J. S. “Hierarchical order in chromosome-specific human alpha satellite DNA”. In:Trends in Genetics3 (1987), pp. 192–198. DOI:10.1016/0168-9525(87)90232-0. 21 rpe1v1.1,hg002(versions 0.7, 0.9, 1.0.1, and 1.1),yao(versions 1.1 and 2.0),i002c,chm13,h9,grch38, imr90,bj 22 (A)hg002centeny map (B)h9centeny map (C)rpe1centeny map (D)chm1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.