pith. sign in

arxiv: 2606.27985 · v1 · pith:NJWM34AAnew · submitted 2026-06-26 · ❄️ cond-mat.stat-mech · physics.bio-ph

Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions

Pith reviewed 2026-06-29 02:31 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech physics.bio-ph
keywords proteome length distributionsTsallis entropynonextensive statistical mechanicsbilaterian transitionq-exponential distributionprotein lengthsevolutionary complexitystatistical signatures
0
0 comments X

The pith

The Tsallis entropic index q marks a shift in proteome length distributions at the bilaterian transition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper fits truncated discrete q-exponential distributions to the complementary cumulative distribution functions of protein lengths from 22 reference proteomes. It reports that the resulting entropic index q falls into three regimes: values below 1 for prokaryotes and basal eukaryotes, intervals spanning 1 for cnidarians and the basal bilaterian C. teleta, and values above 1 that increase from 1.033 to 1.147 across higher bilaterians. The q-exponential fit improves relative to the ordinary exponential and other two-parameter forms as complexity rises. A sympathetic reader would care because the index supplies a single, physically motivated number that tracks the rise in hierarchical proteome organization during the origin of bilaterian animals.

Core claim

Maximum likelihood fitting of truncated discrete q-exponential distributions to the complementary cumulative distribution functions of protein lengths in reference proteomes identifies three distinct regimes for the Tsallis entropic index q: values below 1 for prokaryotes, unicellular and non-animal multicellular eukaryotes, and basal animals; confidence intervals spanning 1 for the cnidarians and basal bilaterian Capitella teleta; and values above 1 for higher bilaterians, monotonically increasing from 1.033 in Strongylocentrotus purpuratus to 1.147 in Homo sapiens.

What carries the argument

The Tsallis entropic index q from maximum-likelihood fits of the truncated discrete q-exponential to proteome-length CCDFs, which quantifies nonextensivity and tracks organizational complexity.

If this is right

  • The q-exponential outperforms the ordinary exponential distribution across all 22 proteomes.
  • The relative performance of the q-exponential against other two-parameter distributions improves as proteome complexity increases.
  • q increases monotonically across the four sampled deuterostomes from sea urchin to human.
  • The boundary regime in q coincides with the phylogenetic position of cnidarians and the basal bilaterian C. teleta.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A broader survey of lophotrochozoan and ecdysozoan proteomes could test whether the boundary regime is a general feature of all basal bilaterians.
  • If q indexes long-range correlations in protein lengths, it may correlate with independent measures such as cell-type diversity or regulatory network depth.
  • The same fitting procedure applied to metagenomic protein-length data could place unsequenced lineages on the same q scale.

Load-bearing premise

The 22 reference proteomes supply an unbiased sample of the bilaterian transition zone and the truncated discrete q-exponential is the correct functional form whose q values can be compared directly across domains.

What would settle it

Sequencing additional proteomes from more cnidarians, basal bilaterians, and early deuterostomes and finding q values that fall outside the reported regime boundaries or reverse the monotonic increase would falsify the claimed transition pattern.

read the original abstract

Protein length distributions across the tree of life carry a quantitative signature of organismal complexity. Nonextensive statistical mechanics, through the Tsallis generalized entropy formalism, provides a natural framework for describing complex systems characterized by long-range correlations, scale invariance, and hierarchical organization -- features that classical Boltzmann-Gibbs statistics cannot accommodate. In this work, the complementary cumulative distribution function (CCDF) of protein lengths is analyzed within this framework for the reference proteomes of 22 fully sequenced organisms spanning the domains Archaea, Bacteria, and Eukarya, with deliberate sampling across the animal transition zone from sponges and cnidarians to higher bilaterians. Maximum likelihood (MLE) fitting of truncated discrete q-exponential distributions, with bootstrap 95% confidence intervals (CIs) reveals that the entropic index q resolves into three statistically distinct regimes: superextensive (q < 1) for prokaryotes, unicellular and non-animal multicellular eukaryotes, and basal animals; a boundary regime (CI on spanning unity) for the two cnidarians studied and the basal bilaterian C. teleta; and subextensive (q > 1) for all higher bilaterians, with q increasing monotonically across the four deuterostomes sampled from S. purpuratus (1.033) to H. sapiens (1.147). The q-exponential outperforms the ordinary exponential distribution across all 22 proteomes and becomes progressively more competitive against alternative two-parameter distributions as proteome complexity increases. These results identify the Tsallis entropic index as a continuous, physically interpretable indicator of proteome organizational complexity and extend the applicability of nonextensive statistical mechanics to proteomic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript fits truncated discrete q-exponential distributions via MLE (with bootstrap 95% CIs) to the complementary cumulative distribution functions of protein lengths in 22 reference proteomes. It reports that the fitted Tsallis index q partitions the organisms into three regimes—superextensive (q < 1) for prokaryotes plus unicellular/non-animal multicellular eukaryotes and basal animals, a boundary regime straddling q = 1 for two cnidarians and the basal bilaterian C. teleta, and subextensive (q > 1) for higher bilaterians with monotonic increase from S. purpuratus (1.033) to H. sapiens (1.147)—and claims that the q-exponential outperforms the ordinary exponential while becoming more competitive with other two-parameter heavy-tailed forms as complexity increases.

Significance. If the reported regime separation and monotonic trend prove robust, the work would supply a continuous, physically interpretable parameter linking nonextensive statistical mechanics to proteome organizational complexity across the bilaterian transition. The explicit use of bootstrap confidence intervals and direct model comparison against the exponential distribution are positive features; however, the deliberate rather than exhaustive sampling and absence of sensitivity checks on truncation and functional form limit the strength of the central claim.

major comments (3)
  1. [Abstract] Abstract: the tripartite regime structure and statistical separation rest on bootstrap 95% CIs from only three boundary-zone proteomes (two cnidarians + C. teleta). Because the headline partition is defined solely by whether these CIs lie below, straddle, or lie above q = 1, even modest changes in truncation cutoff or normalization could move one or more intervals across unity and erase the claimed structure.
  2. [Abstract] Abstract and implied Methods: no explicit statement is given of the precise truncation rules, length-exclusion criteria, or normalization convention used for the discrete q-exponential; these choices directly affect the MLE value of q and therefore the placement of the three boundary CIs that define the central claim.
  3. [Abstract] Abstract: the reported monotonic increase in q across the four deuterostomes is presented without any accompanying sensitivity table showing that the ordering or the CI placements survive substitution of an alternative two-parameter heavy-tailed model or variation of the lower truncation threshold.
minor comments (1)
  1. [Abstract] The abstract states that the q-exponential 'becomes progressively more competitive against alternative two-parameter distributions' but supplies no quantitative comparison (e.g., likelihood ratios or AIC differences) that would allow the reader to assess the strength of that statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of robustness and methodological transparency. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the tripartite regime structure and statistical separation rest on bootstrap 95% CIs from only three boundary-zone proteomes (two cnidarians + C. teleta). Because the headline partition is defined solely by whether these CIs lie below, straddle, or lie above q = 1, even modest changes in truncation cutoff or normalization could move one or more intervals across unity and erase the claimed structure.

    Authors: The tripartite partition is indeed anchored by the three boundary proteomes whose bootstrap CIs straddle q=1, as these were deliberately sampled to probe the transition zone. The remaining 19 proteomes show CIs lying unambiguously below or above unity, and the monotonic rise in q among higher bilaterians provides supporting internal consistency. While we agree that the boundary placement is sensitive to modeling choices, the observed separation aligns with the biological sampling strategy across the bilaterian transition. We will add a brief discussion of this reliance and the rationale for focused sampling. revision: partial

  2. Referee: [Abstract] Abstract and implied Methods: no explicit statement is given of the precise truncation rules, length-exclusion criteria, or normalization convention used for the discrete q-exponential; these choices directly affect the MLE value of q and therefore the placement of the three boundary CIs that define the central claim.

    Authors: We accept that the Methods section lacks an explicit, self-contained description of the truncation rules, length-exclusion criteria, and normalization convention. These were implemented following standard practice for truncated discrete distributions, but the manuscript will be revised to include a dedicated paragraph detailing the exact procedures, including the lower cutoff selection and normalization. revision: yes

  3. Referee: [Abstract] Abstract: the reported monotonic increase in q across the four deuterostomes is presented without any accompanying sensitivity table showing that the ordering or the CI placements survive substitution of an alternative two-parameter heavy-tailed model or variation of the lower truncation threshold.

    Authors: The monotonic ordering is reported from the primary q-exponential fits. We will add a supplementary table that varies the lower truncation threshold for the four deuterostomes and recomputes the q values and CIs, confirming that the ordering is preserved. Full substitution of every alternative two-parameter model for the trend alone was not performed, as the primary model comparison already showed the q-exponential outperforming the exponential and becoming competitive with other heavy-tailed forms; however, the added table will address truncation sensitivity directly. revision: yes

Circularity Check

0 steps flagged

No circularity; q regimes are direct observational classification of MLE fits

full rationale

The paper performs MLE fits of a truncated discrete q-exponential to each proteome's length CCDF and then classifies the resulting q values (with bootstrap CIs) into superextensive, boundary, and subextensive regimes. No equation in the provided text derives q from any other quantity, renames a fitted parameter as a prediction, or reduces the tripartite structure to a self-citation or ansatz; the reported monotonic trend among deuterostomes is likewise a direct reporting of the fitted sequence. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that protein length CCDFs are well-described by truncated discrete q-exponentials and that the 22 sampled proteomes represent the evolutionary transition without selection bias.

free parameters (1)
  • q (Tsallis entropic index)
    Fitted by maximum likelihood to each proteome's CCDF; the reported regime distinctions and monotonic trend are defined by the values of this fitted parameter.
axioms (1)
  • domain assumption The complementary cumulative distribution function of protein lengths follows a truncated discrete q-exponential distribution
    Invoked as the model form for all MLE fits; the statistical separation of regimes depends on this functional choice.

pith-pipeline@v0.9.1-grok · 5840 in / 1430 out tokens · 80887 ms · 2026-06-29T02:31:50.613592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 24 canonical work pages

  1. [1]

    Tsallis, Possible generalization of Boltzmann-Gibbs statistics

    C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52(1-2), 479–487 (1988) https://doi.org/10.1007/BF01016429

  2. [2]

    Picoli Jr., R.S

    S. Picoli Jr., R.S. Mendes, L.C. Malacarne, R.P.B. Santos, q-distributions in complex systems: a brief review. Braz. J. Phys. 39(2A), 468–474 (2009) https://doi.org/10.1590/S0103-97332009000400023

  3. [3]

    Moghaddasi, K

    H. Moghaddasi, K. Khalifeh, A. Darooneh, Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels. Sci. Rep. 7, 41543 (2017) https://doi.org/10.1038/srep41543

  4. [4]

    Tsallis, Non-additive entropies and statistical mechanics at the edge of chaos: a bridge between natural and social sciences

    C. Tsallis, Non-additive entropies and statistical mechanics at the edge of chaos: a bridge between natural and social sciences. Philos. Trans. A Math. Phys. Eng. Sci. 381(2256), 20220293 (2023) https://doi.org/10.1098/rsta.2022.0293

  5. [5]

    Tsallis, Entropic nonextensivity: a possible measure of complexity

    C. Tsallis, Entropic nonextensivity: a possible measure of complexity. Chaos Soliton. Fract. 13(3), 371–391 (2002) https://doi.org/10.1016/S0960-0779(01)00019-4

  6. [6]

    Zhang, Protein-length distribution for the three domains of life

    J. Zhang, Protein-length distribution for the three domains of life. Trends Genet. 16(3), 107–109, (2000) https://doi.org/10.1016/S0168-9525(99)01922-8

  7. [7]

    Oikonomou, A

    Th. Oikonomou, A. Provata, Non-extensive trends in the size distribution of coding and non-coding DNA sequences in the human genome. Eur. Phys. J. B 50, 259–264 (2006) https://doi.org/10.1140/epjb/e2006-00121-2

  8. [8]

    Oikonomou, A

    Th. Oikonomou, A. Provata, U. Tirnakli, Nonextensive statistical approach to non-coding human DNA. Physica A 387(11), 2653–2659 (2008) https://doi.org/10.1016/j.physa.2007.11.051

  9. [9]

    R. Jain, S. Ramakumar, Stochastic dynamics modeling of the protein sequence length distribution in genomes: implications for microbial evolution. Physica A 273(3-4), 476– 485 (1999) https://doi.org/10.1016/S0378-4371(99)00370-2

  10. [10]

    Tiessen, P

    A. Tiessen, P. Pérez-Rodríguez, L.J. Delaye-Arredondo, Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res. Notes 5, 85 (2012) https://doi.org/10.1186...

  11. [11]

    Eroglu, Language-like behavior of protein length distribution in proteomes

    S. Eroglu, Language-like behavior of protein length distribution in proteomes. Complexity, 10(2), 12-21 (2014) https://doi.org/10.1002/cplx.21498

  12. [12]

    Eroglu, Information content estimate of model proteomes: a primary structure perspective

    S. Eroglu, Information content estimate of model proteomes: a primary structure perspective. Curr. Bioinform. 12(6), 490–497 (2017) https://doi.org/10.2174/1574893612666161215165052

  13. [13]

    Nevers, N.M

    Y. Nevers, N.M. Glover, C. Dessimoz, O. Lecompte, Protein length distribution is remarkably uniform across the tree of life. Genome Biol. 24, 135 (2023) https://doi.org/10.1186/s13059-023-02973-2 MANUSCRIPT — Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions 44

  14. [14]

    Tsallis, R.S

    C. Tsallis, R.S. Mendes, A.R. Plastino, The role of constraints within generalized nonextensive statistics. Physica A 261(3-4), 534–554 (1998) https://doi.org/10.1016/S0378-4371(98)00437-3

  15. [15]

    Tsallis, G

    C. Tsallis, G. Bemski, R.S. Mendes, Is reassociation in folded proteins a case of nonextensivity?. Phys. Lett. A 257(1-2), 93–98 (1999) https://doi.org/10.1016/S0375- 9601(99)00270-4

  16. [16]

    Mandelbrot, The Fractal Geometry of Nature, updated and augmented edn

    B.B. Mandelbrot, The Fractal Geometry of Nature, updated and augmented edn. (Freeman, New York, 1983)

  17. [17]

    Plastino, in Nonextensive Statistical Mechanics and Its Applications, ed

    A.R. Plastino, in Nonextensive Statistical Mechanics and Its Applications, ed. S. Abe, Y. Okamoto (Springer, Berlin, 2001), pp. 157–191 https://doi.org/10.1007/3-540-40919-X

  18. [18]

    Accessed 7 December 2025

    UniProtKB, https://www.uniprot.org/proteomes/. Accessed 7 December 2025

  19. [19]

    Gudlaugsdottir, D.R

    S. Gudlaugsdottir, D.R. Boswell, G.R. Wood, J. Ma, Exon size distribution and the origin of introns. Genetica 131, 299–306 (2007) https://doi.org/10.1007/s10709-007-9139-4

  20. [20]

    Beal, Biochemical complexity drives log-normal variation in genetic expression

    J. Beal, Biochemical complexity drives log-normal variation in genetic expression. Eng. Biol. 1(1), 55-60 (2017) https://doi.org/10.1049/enb.2017.0004

  21. [21]

    Akaike, A new look at the statistical model identification

    H. Akaike, A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19(6), 716–723 (1974) https://doi.org/10.1109/TAC.1974.1100705

  22. [22]

    Eroglu, q-exponential fitting for proteomic protein length distribution, (Zenodo, software) https://zenodo.org/records/19914964

    S. Eroglu, q-exponential fitting for proteomic protein length distribution, (Zenodo, software) https://zenodo.org/records/19914964. Accessed April 10 2026

  23. [23]

    Glasauer, S.C.F

    A. Glasauer, S.C.F. Neuhauss, Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol. Genet. Genomics 289(6), 1045–1060 (2014) https://doi.org/10.1007/s00438-014-0889-2

  24. [24]

    Bickel, B.J

    D.R. Bickel, B.J. West, Multiplicative and Fractal Process in DNA Evolution. Fractals 6(3), 211–217 (1998) https://doi.org/10.1142/S0218348X98000262

  25. [25]

    Wolf, P.S

    Y.I. Wolf, P.S. Novichkov, G.P. Karev, E.V. Koonin, D.J. Lipman, The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc. Natl. Acad. Sci. U.S.A. 106(18), 7273–7280 (2009) https://doi.org/10.1073/pnas.0901808106

  26. [26]

    Muro, F.J

    E.M. Muro, F.J. Ballesteros, B. Luque, J. Bascompte, The emergence of eukaryotes as an evolutionary algorithmic phase transition. Proc. Natl. Acad. Sci. U.S.A. 122(13) e2422968122 (2025) https://doi.org/10.1073/pnas.2422968122

  27. [27]

    Burnham, D.R

    K.P. Burnham, D.R. Anderson, Model Selection and Inference: A Practical Information- Theoretic Approach, 2nd edn. (Springer, New York, 2002), pp. 70–72 https://doi.org/10.1007/b97636

  28. [28]

    Altmann, Prolegomena to Menzerath’s law

    G. Altmann, Prolegomena to Menzerath’s law. Glottometrika 2, 1–10 (1980) MANUSCRIPT — Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions 45 TABLES Table 1 Overview of the analyzed reference proteome set: the associated credentials and the statistical information Organism (Abbr. name, UniProtKB ID) Proteome si...