pith. sign in

arxiv: 2606.11276 · v1 · pith:N6KWIQJ5new · submitted 2026-06-09 · 🧬 q-bio.GN

A mathematical framework for centromere-aware evaluation of human genome assemblies

Pith reviewed 2026-06-27 10:36 UTC · model grok-4.3

classification 🧬 q-bio.GN
keywords centromere evaluationgenome assemblyKL divergenceinter-motif distancesT2T genomesrepetitive DNAcenteny representationassembly benchmarking
0
0 comments X

The pith

Centromere quality in genome assemblies can be scored by comparing distributions of inter-motif distances with KL divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames evaluation of centromeres in assembled genomes as a problem of comparing distributions of genomic distances between functional motifs rather than aligning nucleotide sequences. It encodes centromeres in a compact centeny representation and quantifies agreement between a query assembly and a target via KL divergence on those distance distributions. A sympathetic reader cares because conventional alignment methods break down in the highly repetitive and variable centromeric DNA, leaving no reliable way to benchmark assembly accuracy in these regions. When tested on current human telomere-to-telomere genomes the method produces rankings for entire assemblies and for each chromosome separately. If the reduction holds it supplies a rapid numerical score for how faithfully an assembly preserves centromere layout.

Core claim

Centromere assembly evaluation reduces to a comparative distribution problem in centeny representation: genomic distances between functional motifs are collected, the resulting distributions are compared by KL divergence, and the resulting scores rank both whole genomes and individual chromosomes among available human T2T assemblies.

What carries the argument

Centeny representation, which converts centromeres into distributions of inter-motif genomic distances, with KL divergence used to measure agreement between query and target distributions.

If this is right

  • Available T2T human genomes receive an overall accuracy ranking based on centromere fidelity.
  • Each chromosome within those genomes receives its own centromere-specific accuracy score.
  • Repetitive centromeric regions become benchmarkable without reliance on sequence alignment.
  • A quantitative, rapid scoring system for assembly integrity in repetitive DNA is established.
  • Chromosome-level comparisons between genomes gain a distribution-based standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distance-distribution approach could be applied to other large repetitive elements such as telomeres.
  • The metric might be validated by checking whether its rankings predict centromere activity in cellular assays.
  • If motif spacing proves more diagnostic than sequence content, assembly algorithms could be redesigned to prioritize spacing fidelity.
  • The framework offers a route to compare centromeres across species once functional motifs are identified.

Load-bearing premise

Genomic distances between functional motifs in the centeny representation are sufficient to assess functional agreement between query and target centromeres without needing sequence-level information.

What would settle it

Experimental evidence that two centromeres with substantially different inter-motif distance distributions nevertheless produce equivalent kinetochore formation and chromosome segregation would falsify the metric.

read the original abstract

Accurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript frames centromere assembly evaluation as a distribution comparison problem using a centeny representation of inter-motif genomic distances, then applies KL divergence between query and target distributions to produce accuracy rankings for entire T2T human genomes and individual chromosomes.

Significance. If shown to track technical assembly integrity rather than biological variation, the approach could supply an alignment-free quantitative standard for repetitive regions where conventional metrics fail.

major comments (2)
  1. [Abstract] Abstract: the central claim that the KL-divergence metric 'yields an accuracy ranking' for T2T assemblies rests on an untested assumption that inter-motif distance differences are dominated by assembly errors; the manuscript supplies no benchmark against documented centromeric misassemblies, simulated motif-order perturbations, or orthogonal quality signals (e.g., read-depth or Hi-C consistency).
  2. [Abstract] Abstract and framework description: because the compared T2T genomes derive from different individuals, any observed ranking could equally reflect haplotype-specific variation in repeat-array length or motif spacing; no analysis separates these sources, rendering the accuracy interpretation circular without external validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the manuscript. We address each major point below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the KL-divergence metric 'yields an accuracy ranking' for T2T assemblies rests on an untested assumption that inter-motif distance differences are dominated by assembly errors; the manuscript supplies no benchmark against documented centromeric misassemblies, simulated motif-order perturbations, or orthogonal quality signals (e.g., read-depth or Hi-C consistency).

    Authors: We agree that direct validation is needed to support the accuracy interpretation. The manuscript currently presents the KL-divergence on centeny representations as a comparative metric without explicit tests against known errors. In revision we will add a dedicated validation subsection that includes (i) controlled simulations of motif-order swaps and array-length perturbations and (ii) comparison of the resulting KL scores against available read-depth and Hi-C consistency annotations for the same assemblies. These additions will quantify the metric's sensitivity to assembly artifacts. revision: yes

  2. Referee: [Abstract] Abstract and framework description: because the compared T2T genomes derive from different individuals, any observed ranking could equally reflect haplotype-specific variation in repeat-array length or motif spacing; no analysis separates these sources, rendering the accuracy interpretation circular without external validation.

    Authors: This observation is correct and highlights a genuine interpretive limit. Because the assemblies come from distinct donors, observed differences in inter-motif distance distributions necessarily conflate technical assembly discrepancies with natural haplotype variation. We will revise the abstract and discussion to reframe the output explicitly as a measure of structural concordance with a chosen target assembly rather than an absolute accuracy ranking. We will also note that full separation of biological from technical sources would require same-individual multi-assembly datasets, which are not yet available for these centromeres. revision: partial

Circularity Check

0 steps flagged

No circularity detected; metric is a direct definition applied to data

full rationale

The paper defines its centeny representation and KL-divergence metric directly from inter-motif genomic distances as a comparative distribution problem. Applying this defined metric to T2T assemblies produces rankings by construction of the definition itself, without any reduction of a claimed prediction to a fitted parameter, self-citation chain, or imported uniqueness theorem. No equations or steps in the provided text exhibit self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via citation. The framework is self-contained as an explicit numerical rendering and comparison method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5720 in / 921 out tokens · 21464 ms · 2026-06-27T10:36:30.683335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 23 canonical work pages

  1. [1]

    Chronos-2: From univariate to universal forecasting

    Ansari, A. F. et al. “Chronos-2: From univariate to universal forecasting”. In:arXiv(2025)

  2. [2]

    AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

    Avsec, Ž. et al. “AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model”. In:bioRxiv(2025), pp. 2025–06. DOI:10.1101/2025.06.25.661532

  3. [3]

    and Giunta, S.Centromeres under pressure: evolutionary innovation in conflict with con- served function

    Balzano, E. and Giunta, S.Centromeres under pressure: evolutionary innovation in conflict with con- served function. Genes (Basel) 11, 912. 2020. DOI:10.3390/genes11080912

  4. [4]

    Genome modeling and design across all domains of life with Evo 2

    Brixi, G. et al. “Genome modeling and design across all domains of life with Evo 2”. In:bioRxiv(2025), pp. 2025–02. DOI:10.1101/2025.02.18.638918

  5. [5]

    Chromosome-specific centromeric patterns define the centeny map of the human genome

    Corda, L. and Giunta, S. “Chromosome-specific centromeric patterns define the centeny map of the human genome”. In:Science389.6755 (2025), eads3484. DOI:10.1126/science.ads3484

  6. [6]

    Cell line-matched reference enables high-precision functional genomics

    Corda, L. et al. “Cell line-matched reference enables high-precision functional genomics”. In:Nature Communications(2025). DOI:10.1038/s41467-025-66155-3

  7. [7]

    Identification of a family of human centromere proteins using autoimmune sera from patients with scleroderma

    Earnshaw, W. C. and Rothfield, N. “Identification of a family of human centromere proteins using autoimmune sera from patients with scleroderma”. In:Chromosoma91.3 (1985), pp. 313–321. DOI:10.1007/BF00328227

  8. [8]

    Aglobalviewofhumancentromerevariationandevolution

    Gao,S.etal.“Aglobalviewofhumancentromerevariationandevolution”.In:bioRxiv(2025),pp.2025–

  9. [9]

    DOI:10.64898/2025.12.09.693231

  10. [10]

    40 years of CENP-A: the foundation of a new era of centromere biology

    Giunta, S. et al. “40 years of CENP-A: the foundation of a new era of centromere biology”. In:Chro- mosome Research33.1 (2025), p. 32. DOI:10.1007/s10577-025-09790-2

  11. [11]

    A complete diploid human genome benchmark for personalized genomics

    Hansen, N. F. et al. “A complete diploid human genome benchmark for personalized genomics”. In: biorxiv(2025). DOI:10.1101/2025.09.21.677443

  12. [12]

    The centromere paradox: stable inheritance with rapidly evolving DNA

    Henikoff, S., Ahmad, K., and Malik, H. S. “The centromere paradox: stable inheritance with rapidly evolving DNA”. In:Science293.5532 (2001), pp. 1098–1102. DOI:10.1126/science.1062939

  13. [13]

    Semi-automated assembly of high-quality diploid human reference genomes

    Jarvis, E. D. et al. “Semi-automated assembly of high-quality diploid human reference genomes”. In: Nature611.7936 (2022), pp. 519–531. DOI:10.1038/s41586-022-05325-5

  14. [14]

    DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

    Ji, Y. et al. “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome”. In:Bioinformatics37.15 (2021), pp. 2112–2120. DOI:10.1093/bioinformatics/btab083

  15. [15]

    Initial sequencing and analysis of the human genome

    Lander, E. et al. “Initial sequencing and analysis of the human genome”. In:Nature409.6822 (2001), pp. 860–921. DOI:10.1038/35057062

  16. [16]

    The variation and evolution of complete human centromeres

    Logsdon, G. A. et al. “The variation and evolution of complete human centromeres”. In:Nature629.8010 (2024), pp. 136–145. DOI:10.1101/2023.05.30.542849

  17. [17]

    Centromere studies in the era of ‘telomere-to-telomere’genomics

    Miga, K. H. “Centromere studies in the era of ‘telomere-to-telomere’genomics”. In:Experimental cell research394.2 (2020), p. 112127. DOI:10.1016/j.yexcr.2020.112127

  18. [18]

    Variation and evolution of human centromeres: a field guide and perspective

    Miga, K. H. and Alexandrov, I. A. “Variation and evolution of human centromeres: a field guide and perspective”. In:Annual review of genetics55.1 (2021), pp. 583–602. DOI:10.1146/annurev-genet-071719-020519

  19. [19]

    and Mikheenko, Alla and Vollger, Mitchell R

    Nurk, S. et al. “The complete sequence of a human genome”. In:Science376.6588 (2022), pp. 44–53. DOI:10.1126/science.abj6987

  20. [20]

    The diploid reference genome of a human embryonic stem cell line

    Pačar, I. et al. “The diploid reference genome of a human embryonic stem cell line”. In:bioRxiv(2026), pp. 2026–03. DOI:10.64898/2026.03.26.714432

  21. [21]

    Haplotype-resolved genome assemblies of BJ and IMR-90 human fibroblast cell lines reveal extensive structural variation and enable reanalysis of historical sequencing data

    Ranallo-Benavidez, T. R. et al. “Haplotype-resolved genome assemblies of BJ and IMR-90 human fibroblast cell lines reveal extensive structural variation and enable reanalysis of historical sequencing data”. In:Nucleic acids research54.8 (2026), gkag333. DOI:10.1093/nar/gkag333

  22. [22]

    and Logsdon, Glennis A

    Rautiainen, M. et al. “Telomere-to-telomere assembly of diploid chromosomes with Verkko”. In:Nature biotechnology41.10 (2023), pp. 1474–1482. DOI:10.1038/s41587-023-01662-6. 20

  23. [23]

    ThereferencegenomeofthehumandiploidcelllineRPE-1

    Volpe,E.etal.“ThereferencegenomeofthehumandiploidcelllineRPE-1”.In:Nature Communications 16.1 (2025), p. 7751. DOI:10.1038/s41467-025-62428-z

  24. [24]

    The Human Pangenome Project: a global resource to map genomic diversity

    Wang, T. et al. “The Human Pangenome Project: a global resource to map genomic diversity”. In: Nature604.7906 (2022), pp. 437–446. DOI:10.1038/s41586-022-04601-8

  25. [25]

    Hierarchical order in chromosome-specific human alpha satellite DNA

    Willard, H. F. and Waye, J. S. “Hierarchical order in chromosome-specific human alpha satellite DNA”. In:Trends in Genetics3 (1987), pp. 192–198. DOI:10.1016/0168-9525(87)90232-0. 21 rpe1v1.1,hg002(versions 0.7, 0.9, 1.0.1, and 1.1),yao(versions 1.1 and 2.0),i002c,chm13,h9,grch38, imr90,bj 22 (A)hg002centeny map (B)h9centeny map (C)rpe1centeny map (D)chm1...