Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions
Pith reviewed 2026-06-29 02:31 UTC · model grok-4.3
The pith
The Tsallis entropic index q marks a shift in proteome length distributions at the bilaterian transition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Maximum likelihood fitting of truncated discrete q-exponential distributions to the complementary cumulative distribution functions of protein lengths in reference proteomes identifies three distinct regimes for the Tsallis entropic index q: values below 1 for prokaryotes, unicellular and non-animal multicellular eukaryotes, and basal animals; confidence intervals spanning 1 for the cnidarians and basal bilaterian Capitella teleta; and values above 1 for higher bilaterians, monotonically increasing from 1.033 in Strongylocentrotus purpuratus to 1.147 in Homo sapiens.
What carries the argument
The Tsallis entropic index q from maximum-likelihood fits of the truncated discrete q-exponential to proteome-length CCDFs, which quantifies nonextensivity and tracks organizational complexity.
If this is right
- The q-exponential outperforms the ordinary exponential distribution across all 22 proteomes.
- The relative performance of the q-exponential against other two-parameter distributions improves as proteome complexity increases.
- q increases monotonically across the four sampled deuterostomes from sea urchin to human.
- The boundary regime in q coincides with the phylogenetic position of cnidarians and the basal bilaterian C. teleta.
Where Pith is reading between the lines
- A broader survey of lophotrochozoan and ecdysozoan proteomes could test whether the boundary regime is a general feature of all basal bilaterians.
- If q indexes long-range correlations in protein lengths, it may correlate with independent measures such as cell-type diversity or regulatory network depth.
- The same fitting procedure applied to metagenomic protein-length data could place unsequenced lineages on the same q scale.
Load-bearing premise
The 22 reference proteomes supply an unbiased sample of the bilaterian transition zone and the truncated discrete q-exponential is the correct functional form whose q values can be compared directly across domains.
What would settle it
Sequencing additional proteomes from more cnidarians, basal bilaterians, and early deuterostomes and finding q values that fall outside the reported regime boundaries or reverse the monotonic increase would falsify the claimed transition pattern.
read the original abstract
Protein length distributions across the tree of life carry a quantitative signature of organismal complexity. Nonextensive statistical mechanics, through the Tsallis generalized entropy formalism, provides a natural framework for describing complex systems characterized by long-range correlations, scale invariance, and hierarchical organization -- features that classical Boltzmann-Gibbs statistics cannot accommodate. In this work, the complementary cumulative distribution function (CCDF) of protein lengths is analyzed within this framework for the reference proteomes of 22 fully sequenced organisms spanning the domains Archaea, Bacteria, and Eukarya, with deliberate sampling across the animal transition zone from sponges and cnidarians to higher bilaterians. Maximum likelihood (MLE) fitting of truncated discrete q-exponential distributions, with bootstrap 95% confidence intervals (CIs) reveals that the entropic index q resolves into three statistically distinct regimes: superextensive (q < 1) for prokaryotes, unicellular and non-animal multicellular eukaryotes, and basal animals; a boundary regime (CI on spanning unity) for the two cnidarians studied and the basal bilaterian C. teleta; and subextensive (q > 1) for all higher bilaterians, with q increasing monotonically across the four deuterostomes sampled from S. purpuratus (1.033) to H. sapiens (1.147). The q-exponential outperforms the ordinary exponential distribution across all 22 proteomes and becomes progressively more competitive against alternative two-parameter distributions as proteome complexity increases. These results identify the Tsallis entropic index as a continuous, physically interpretable indicator of proteome organizational complexity and extend the applicability of nonextensive statistical mechanics to proteomic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript fits truncated discrete q-exponential distributions via MLE (with bootstrap 95% CIs) to the complementary cumulative distribution functions of protein lengths in 22 reference proteomes. It reports that the fitted Tsallis index q partitions the organisms into three regimes—superextensive (q < 1) for prokaryotes plus unicellular/non-animal multicellular eukaryotes and basal animals, a boundary regime straddling q = 1 for two cnidarians and the basal bilaterian C. teleta, and subextensive (q > 1) for higher bilaterians with monotonic increase from S. purpuratus (1.033) to H. sapiens (1.147)—and claims that the q-exponential outperforms the ordinary exponential while becoming more competitive with other two-parameter heavy-tailed forms as complexity increases.
Significance. If the reported regime separation and monotonic trend prove robust, the work would supply a continuous, physically interpretable parameter linking nonextensive statistical mechanics to proteome organizational complexity across the bilaterian transition. The explicit use of bootstrap confidence intervals and direct model comparison against the exponential distribution are positive features; however, the deliberate rather than exhaustive sampling and absence of sensitivity checks on truncation and functional form limit the strength of the central claim.
major comments (3)
- [Abstract] Abstract: the tripartite regime structure and statistical separation rest on bootstrap 95% CIs from only three boundary-zone proteomes (two cnidarians + C. teleta). Because the headline partition is defined solely by whether these CIs lie below, straddle, or lie above q = 1, even modest changes in truncation cutoff or normalization could move one or more intervals across unity and erase the claimed structure.
- [Abstract] Abstract and implied Methods: no explicit statement is given of the precise truncation rules, length-exclusion criteria, or normalization convention used for the discrete q-exponential; these choices directly affect the MLE value of q and therefore the placement of the three boundary CIs that define the central claim.
- [Abstract] Abstract: the reported monotonic increase in q across the four deuterostomes is presented without any accompanying sensitivity table showing that the ordering or the CI placements survive substitution of an alternative two-parameter heavy-tailed model or variation of the lower truncation threshold.
minor comments (1)
- [Abstract] The abstract states that the q-exponential 'becomes progressively more competitive against alternative two-parameter distributions' but supplies no quantitative comparison (e.g., likelihood ratios or AIC differences) that would allow the reader to assess the strength of that statement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of robustness and methodological transparency. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the tripartite regime structure and statistical separation rest on bootstrap 95% CIs from only three boundary-zone proteomes (two cnidarians + C. teleta). Because the headline partition is defined solely by whether these CIs lie below, straddle, or lie above q = 1, even modest changes in truncation cutoff or normalization could move one or more intervals across unity and erase the claimed structure.
Authors: The tripartite partition is indeed anchored by the three boundary proteomes whose bootstrap CIs straddle q=1, as these were deliberately sampled to probe the transition zone. The remaining 19 proteomes show CIs lying unambiguously below or above unity, and the monotonic rise in q among higher bilaterians provides supporting internal consistency. While we agree that the boundary placement is sensitive to modeling choices, the observed separation aligns with the biological sampling strategy across the bilaterian transition. We will add a brief discussion of this reliance and the rationale for focused sampling. revision: partial
-
Referee: [Abstract] Abstract and implied Methods: no explicit statement is given of the precise truncation rules, length-exclusion criteria, or normalization convention used for the discrete q-exponential; these choices directly affect the MLE value of q and therefore the placement of the three boundary CIs that define the central claim.
Authors: We accept that the Methods section lacks an explicit, self-contained description of the truncation rules, length-exclusion criteria, and normalization convention. These were implemented following standard practice for truncated discrete distributions, but the manuscript will be revised to include a dedicated paragraph detailing the exact procedures, including the lower cutoff selection and normalization. revision: yes
-
Referee: [Abstract] Abstract: the reported monotonic increase in q across the four deuterostomes is presented without any accompanying sensitivity table showing that the ordering or the CI placements survive substitution of an alternative two-parameter heavy-tailed model or variation of the lower truncation threshold.
Authors: The monotonic ordering is reported from the primary q-exponential fits. We will add a supplementary table that varies the lower truncation threshold for the four deuterostomes and recomputes the q values and CIs, confirming that the ordering is preserved. Full substitution of every alternative two-parameter model for the trend alone was not performed, as the primary model comparison already showed the q-exponential outperforming the exponential and becoming competitive with other heavy-tailed forms; however, the added table will address truncation sensitivity directly. revision: yes
Circularity Check
No circularity; q regimes are direct observational classification of MLE fits
full rationale
The paper performs MLE fits of a truncated discrete q-exponential to each proteome's length CCDF and then classifies the resulting q values (with bootstrap CIs) into superextensive, boundary, and subextensive regimes. No equation in the provided text derives q from any other quantity, renames a fitted parameter as a prediction, or reduces the tripartite structure to a self-citation or ansatz; the reported monotonic trend among deuterostomes is likewise a direct reporting of the fitted sequence. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- q (Tsallis entropic index)
axioms (1)
- domain assumption The complementary cumulative distribution function of protein lengths follows a truncated discrete q-exponential distribution
Reference graph
Works this paper leans on
-
[1]
Tsallis, Possible generalization of Boltzmann-Gibbs statistics
C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52(1-2), 479–487 (1988) https://doi.org/10.1007/BF01016429
-
[2]
S. Picoli Jr., R.S. Mendes, L.C. Malacarne, R.P.B. Santos, q-distributions in complex systems: a brief review. Braz. J. Phys. 39(2A), 468–474 (2009) https://doi.org/10.1590/S0103-97332009000400023
-
[3]
H. Moghaddasi, K. Khalifeh, A. Darooneh, Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels. Sci. Rep. 7, 41543 (2017) https://doi.org/10.1038/srep41543
-
[4]
C. Tsallis, Non-additive entropies and statistical mechanics at the edge of chaos: a bridge between natural and social sciences. Philos. Trans. A Math. Phys. Eng. Sci. 381(2256), 20220293 (2023) https://doi.org/10.1098/rsta.2022.0293
-
[5]
Tsallis, Entropic nonextensivity: a possible measure of complexity
C. Tsallis, Entropic nonextensivity: a possible measure of complexity. Chaos Soliton. Fract. 13(3), 371–391 (2002) https://doi.org/10.1016/S0960-0779(01)00019-4
-
[6]
Zhang, Protein-length distribution for the three domains of life
J. Zhang, Protein-length distribution for the three domains of life. Trends Genet. 16(3), 107–109, (2000) https://doi.org/10.1016/S0168-9525(99)01922-8
-
[7]
Th. Oikonomou, A. Provata, Non-extensive trends in the size distribution of coding and non-coding DNA sequences in the human genome. Eur. Phys. J. B 50, 259–264 (2006) https://doi.org/10.1140/epjb/e2006-00121-2
-
[8]
Th. Oikonomou, A. Provata, U. Tirnakli, Nonextensive statistical approach to non-coding human DNA. Physica A 387(11), 2653–2659 (2008) https://doi.org/10.1016/j.physa.2007.11.051
-
[9]
R. Jain, S. Ramakumar, Stochastic dynamics modeling of the protein sequence length distribution in genomes: implications for microbial evolution. Physica A 273(3-4), 476– 485 (1999) https://doi.org/10.1016/S0378-4371(99)00370-2
-
[10]
A. Tiessen, P. Pérez-Rodríguez, L.J. Delaye-Arredondo, Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res. Notes 5, 85 (2012) https://doi.org/10.1186...
-
[11]
Eroglu, Language-like behavior of protein length distribution in proteomes
S. Eroglu, Language-like behavior of protein length distribution in proteomes. Complexity, 10(2), 12-21 (2014) https://doi.org/10.1002/cplx.21498
-
[12]
Eroglu, Information content estimate of model proteomes: a primary structure perspective
S. Eroglu, Information content estimate of model proteomes: a primary structure perspective. Curr. Bioinform. 12(6), 490–497 (2017) https://doi.org/10.2174/1574893612666161215165052
-
[13]
Y. Nevers, N.M. Glover, C. Dessimoz, O. Lecompte, Protein length distribution is remarkably uniform across the tree of life. Genome Biol. 24, 135 (2023) https://doi.org/10.1186/s13059-023-02973-2 MANUSCRIPT — Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions 44
-
[14]
C. Tsallis, R.S. Mendes, A.R. Plastino, The role of constraints within generalized nonextensive statistics. Physica A 261(3-4), 534–554 (1998) https://doi.org/10.1016/S0378-4371(98)00437-3
-
[15]
C. Tsallis, G. Bemski, R.S. Mendes, Is reassociation in folded proteins a case of nonextensivity?. Phys. Lett. A 257(1-2), 93–98 (1999) https://doi.org/10.1016/S0375- 9601(99)00270-4
-
[16]
Mandelbrot, The Fractal Geometry of Nature, updated and augmented edn
B.B. Mandelbrot, The Fractal Geometry of Nature, updated and augmented edn. (Freeman, New York, 1983)
1983
-
[17]
Plastino, in Nonextensive Statistical Mechanics and Its Applications, ed
A.R. Plastino, in Nonextensive Statistical Mechanics and Its Applications, ed. S. Abe, Y. Okamoto (Springer, Berlin, 2001), pp. 157–191 https://doi.org/10.1007/3-540-40919-X
-
[18]
Accessed 7 December 2025
UniProtKB, https://www.uniprot.org/proteomes/. Accessed 7 December 2025
2025
-
[19]
S. Gudlaugsdottir, D.R. Boswell, G.R. Wood, J. Ma, Exon size distribution and the origin of introns. Genetica 131, 299–306 (2007) https://doi.org/10.1007/s10709-007-9139-4
-
[20]
Beal, Biochemical complexity drives log-normal variation in genetic expression
J. Beal, Biochemical complexity drives log-normal variation in genetic expression. Eng. Biol. 1(1), 55-60 (2017) https://doi.org/10.1049/enb.2017.0004
-
[21]
Akaike, A new look at the statistical model identification
H. Akaike, A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19(6), 716–723 (1974) https://doi.org/10.1109/TAC.1974.1100705
-
[22]
S. Eroglu, q-exponential fitting for proteomic protein length distribution, (Zenodo, software) https://zenodo.org/records/19914964. Accessed April 10 2026
arXiv 2026
-
[23]
A. Glasauer, S.C.F. Neuhauss, Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol. Genet. Genomics 289(6), 1045–1060 (2014) https://doi.org/10.1007/s00438-014-0889-2
-
[24]
D.R. Bickel, B.J. West, Multiplicative and Fractal Process in DNA Evolution. Fractals 6(3), 211–217 (1998) https://doi.org/10.1142/S0218348X98000262
-
[25]
Y.I. Wolf, P.S. Novichkov, G.P. Karev, E.V. Koonin, D.J. Lipman, The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc. Natl. Acad. Sci. U.S.A. 106(18), 7273–7280 (2009) https://doi.org/10.1073/pnas.0901808106
-
[26]
E.M. Muro, F.J. Ballesteros, B. Luque, J. Bascompte, The emergence of eukaryotes as an evolutionary algorithmic phase transition. Proc. Natl. Acad. Sci. U.S.A. 122(13) e2422968122 (2025) https://doi.org/10.1073/pnas.2422968122
-
[27]
K.P. Burnham, D.R. Anderson, Model Selection and Inference: A Practical Information- Theoretic Approach, 2nd edn. (Springer, New York, 2002), pp. 70–72 https://doi.org/10.1007/b97636
-
[28]
Altmann, Prolegomena to Menzerath’s law
G. Altmann, Prolegomena to Menzerath’s law. Glottometrika 2, 1–10 (1980) MANUSCRIPT — Nonextensive Statistical Signatures of the Bilaterian Transition in Proteome Length Distributions 45 TABLES Table 1 Overview of the analyzed reference proteome set: the associated credentials and the statistical information Organism (Abbr. name, UniProtKB ID) Proteome si...
1980
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.