pith. sign in

arxiv: 2012.00742 · v3 · submitted 2020-12-01 · 🧮 math.PR · math.CO· math.ST· stat.TH

Spectral Analysis of Word Statistics

Pith reviewed 2026-05-24 13:12 UTC · model grok-4.3

classification 🧮 math.PR math.COmath.STstat.TH
keywords word statisticssubsequence countsspectral decompositioncovariance matrixrandom textsasymptotic structurealgebraic operators
0
0 comments X

The pith

Word subsequence counts in random texts over finite alphabets have their covariance matrices decomposed into explicit eigenvectors and eigenvalues by order.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that counts of fixed-length words as subsequences in a growing random text over a finite alphabet have a joint distribution with a layered asymptotic structure. It characterizes all linear combinations of these statistics according to their orders of magnitude. The main result is the spectral decomposition of the space of word statistics for each order, together with explicit formulas for the eigenvectors and eigenvalues of the associated covariance matrices. This algebraic structure unifies several classical combinatorial examples and statistical tests within one framework.

Core claim

We establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics.

What carries the argument

Algebraic word operators that decompose the space of word statistics of each order into eigenspaces with explicit covariance eigenvectors and eigenvalues.

If this is right

  • Special cases such as intransitive dice and random core partitions are described within the same unified framework.
  • Several classical statistical tests receive a common structural description.
  • The decomposition opens further applications to data analysis and machine learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit spectrum may simplify higher-order moment calculations in large sequence datasets.
  • Analogous decompositions could be sought for counting statistics in other discrete probability models.
  • The approach may yield closed-form expressions for certain random walk functionals.

Load-bearing premise

The joint distribution of word subsequence counts in a random text over a finite alphabet admits an asymptotic structure that allows algebraic decomposition as text length grows.

What would settle it

Generate many random texts over a small alphabet, compute the sample covariance matrix of word subsequence counts for fixed lengths, and check whether its eigenvectors and eigenvalues match the paper's explicit formulas.

read the original abstract

Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude using diverse algebraic tools. Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. Our techniques include and elaborate on a set of algebraic word operators, recently studied and employed by Dieker and Saliola (Adv Math, 2018). Subword counts find applications in Combinatorics, Statistics, and Computer Science. We revisit special cases from the combinatorial literature, such as intransitive dice, random core partitions, and questions on random walk. Our structural approach describes in a unified framework several classical statistical tests. We propose further potential applications to data analysis and machine learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper studies the asymptotic joint distribution of subsequence counts for fixed-length words in random texts over a finite alphabet. It claims to fully characterize the different orders of magnitude of all linear combinations of these subword statistics via algebraic tools, and to establish the spectral decomposition of the space of word statistics of each order, including explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution. The approach elaborates on algebraic word operators from Dieker and Saliola (2018) and applies the framework to classical problems such as intransitive dice, random core partitions, and random walks, while proposing uses in data analysis and machine learning.

Significance. If the explicit spectral formulas and decomposition hold, the work supplies a unified algebraic structure for analyzing dependencies among subsequence statistics, which could streamline several classical statistical tests and combinatorial enumerations. The provision of closed-form eigenvectors/eigenvalues for the limiting covariance at each order is a concrete advance over purely asymptotic or simulation-based approaches in the area.

minor comments (4)
  1. §2: The definition of the word operators and their action on the space of statistics would benefit from an explicit small-alphabet example (e.g., binary alphabet, length-2 words) to illustrate the claimed commutation relations before the general spectral theorem is stated.
  2. §4.2, Theorem 4.3: The statement that the eigenvectors are 'parameter-free' should be qualified by noting the dependence on the underlying letter probabilities; the current wording risks suggesting independence from the text model.
  3. The applications to intransitive dice and random core partitions are sketched but lack a self-contained verification that the general spectral formulas recover the known variances or correlations in those special cases; adding one short calculation per example would strengthen the unified-framework claim.
  4. Notation: The distinction between the finite-n covariance and its limiting form is occasionally blurred in the text; consistent use of subscripts (e.g., Σ_n vs. Σ) would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the accurate summary of its contributions, and the recommendation for minor revision. No major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives explicit eigenvector and eigenvalue formulas for the covariance of word-subsequence statistics by elaborating algebraic operators from the external reference Dieker and Saliola (Adv. Math. 2018). No step reduces a claimed result to a fitted parameter, self-definition, or self-citation chain; the central spectral decomposition is obtained via algebraic manipulation of the cited operators applied to the subsequence-count model, which is independent of the target formulas. The analysis is therefore not circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the work relies on standard domain assumptions about random text generation but introduces no free parameters or invented entities; it extends prior algebraic operators without adding new postulates.

axioms (1)
  • domain assumption Random text generated over a finite alphabet yields subsequence word counts with well-defined asymptotic joint distributions
    This modeling choice underpins the entire asymptotic and spectral analysis.

pith-pipeline@v0.9.0 · 5708 in / 1223 out tokens · 33124 ms · 2026-05-24T13:12:44.232459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    On the distribution of the two-sample C ram \'e r-von M ises criterion

    Theodore W Anderson. On the distribution of the two-sample C ram \'e r-von M ises criterion. The Annals of Mathematical Statistics , pages 1148--1159, 1962

  2. [2]

    Partitions which are simultaneously t_1 -and t_2 -core

    Jaclyn Anderson. Partitions which are simultaneously t_1 -and t_2 -core. Discrete Mathematics , 248(1-3):237--243, 2002

  3. [3]

    Intransitive dice

    Brian Conrey, James Gabbard, Katie Grant, Andrew Liu, and Kent E Morrison. Intransitive dice. Mathematics Magazine , 89(2):133--143, 2016

  4. [4]

    Integration of paths -- a faithful representation of paths by noncommutative formal power series

    Kuo-Tsai Chen. Integration of paths -- a faithful representation of paths by noncommutative formal power series. Transactions of the American Mathematical Society , 89(2):395--407, 1958

  5. [5]

    A primer on the signature method in machine learning

    Ilya Chevyrev and Andrey Kormilitzin. A primer on the signature method in machine learning. arXiv preprint arXiv:1603.03788 , 2016

  6. [6]

    Group representations in probability and statistics

    Persi Diaconis. Group representations in probability and statistics. Lecture notes -- monograph series , 11:i--192, 1988

  7. [7]

    Spectral analysis of random-to-random markov chains

    Anton B Dieker and Franco V Saliola. Spectral analysis of random-to-random markov chains. Advances in Mathematics , 323:427--485, 2018

  8. [8]

    Orthogonal expansions and U -statistics

    GK Eagleson. Orthogonal expansions and U -statistics. Australian Journal of Statistics , 21(3):221--237, 1979

  9. [9]

    Invariants of random knots and links

    Chaim Even -Zohar , Joel Hass, Nati Linial, and Tahl Nowik. Invariants of random knots and links. Discrete & Computational Geometry , 56(2):274--314, 2016

  10. [10]

    Patterns in random permutations

    Chaim Even -Zohar . Patterns in random permutations. Combinatorica , pages 1--30, 2020

  11. [11]

    Sizes of simultaneous core partitions

    Chaim Even -Zohar . Sizes of simultaneous core partitions. arXiv preprint arXiv:2003.13671 , 2020

  12. [12]

    Explicit Expressions for the Variance and Higher Moments of the Size of a Simultaneous Core Partition and its Limiting Distribution

    Shalosh B Ekhad and Doron Zeilberger. Explicit expressions for the variance and higher moments of the size of a simultaneous core partition and its limiting distribution. arXiv preprint arXiv:1508.07637 , 2015

  13. [13]

    A Treatise on Sucker's Bets

    Shalosh B Ekhad and Doron Zeilberger. A treatise on sucker's bets. arXiv preprint arXiv:1710.10344 , 2017

  14. [14]

    Representation theory: a first course , volume 129

    William Fulton and Joe Harris. Representation theory: a first course , volume 129. Springer Science & Business Media, 2013

  15. [15]

    Martin Gardner. The colossal book of mathematics: classic puzzles, paradoxes, and problems: number theory, algebra, geometry, probability, topology, game theory, infinity, and other topics of recreational mathematics . WW Norton & Company, 2001

  16. [16]

    Marches al \'e atoires sur les groupes de L ie , volume 624

    Yves Guivarc'h, Michael Keane, and Bernard Roynette. Marches al \'e atoires sur les groupes de L ie , volume 624. Springer, 1977

  17. [17]

    The probability of intransitivity in dice and close elections

    Jan H a z a, Elchanan Mossel, Nathan Ross, and Guangqu Zheng. The probability of intransitivity in dice and close elections. Probability Theory and Related Fields , pages 1--59, 2020

  18. [18]

    A class of statistics with asymptotically normal distribution

    Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics , 19(3):293--325, 1948

  19. [19]

    Gaussian H ilbert spaces , volume 129

    Svante Janson. Gaussian H ilbert spaces , volume 129. Cambridge university press, 1997

  20. [20]

    Renewal theory for asymmetric U -statistics

    Svante Janson. Renewal theory for asymmetric U -statistics. Electronic Journal of Probability , 23, 2018

  21. [21]

    Random graphs , volume 45

    Svante Janson, Tomasz Luczak, and Andrzej Rucinski. Random graphs , volume 45. John Wiley & Sons, 2011

  22. [22]

    The asymptotic distributions of generalized U -statistics with applications to random graphs

    Svante Janson and Krzysztof Nowicki. The asymptotic distributions of generalized U -statistics with applications to random graphs. Probability theory and related fields , 90(3):341--375, 1991

  23. [23]

    Theory of U -statistics

    VS Koroljuk and Yu V Borovskich. Theory of U -statistics . Springer, 1994

  24. [24]

    K-sample analogues of the K olmogorov-- S mirnov and C ram \'e r--v

    J Kiefer. K-sample analogues of the K olmogorov-- S mirnov and C ram \'e r--v. M ises tests. The Annals of Mathematical Statistics , pages 420--447, 1959

  25. [25]

    U-statistics: Theory and Practice

    Justin Lee. U-statistics: Theory and Practice . Citeseer, 1990

  26. [26]

    Consistency and unbiasedness of certain nonparametric tests

    Eric L Lehmann. Consistency and unbiasedness of certain nonparametric tests. The annals of mathematical statistics , pages 165--179, 1951

  27. [27]

    Wiener's random function, and other L aplacian random functions

    Paul L \'e vy. Wiener's random function, and other L aplacian random functions. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability . The Regents of the University of California, 1951

  28. [28]

    Learning from the past, predicting the statistics for the future, learning an evolving system

    Daniel Levin, Terry Lyons, and Hao Ni. Learning from the past, predicting the statistics for the future, learning an evolving system. arXiv preprint arXiv:1309.0260 , 2013

  29. [29]

    Differential equations driven by rough signals

    Terry J Lyons. Differential equations driven by rough signals. Revista Matem \'a tica Iberoamericana , 14(2):215--310, 1998

  30. [30]

    On the distribution of the area enclosed by a random walk on Z^2

    James A Mingo and Alexandru Nica. On the distribution of the area enclosed by a random walk on Z^2 . Journal of Combinatorial Theory, Series A , 84(1):55--86, 1998

  31. [31]

    On a test of whether one of two random variables is stochastically larger than the other

    Henry B Mann and Donald R Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics , pages 50--60, 1947

  32. [32]

    Karl Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 50(302):157--175, 1900

  33. [33]

    A new way to obtain W atson's U^2

    Tore Persson. A new way to obtain W atson's U^2 . Scandinavian Journal of Statistics , pages 119--122, 1979

  34. [34]

    Polymath

    D. Polymath. Polymath 13 -- a success! \ ://polymathprojects.org/2017/08/22/polymath- 13-a-success & Gowers's Weblog , 2017

  35. [35]

    Some distribution-free k-sample rank tests of homogeneity against ordered alternatives

    Madan L Puri. Some distribution-free k-sample rank tests of homogeneity against ordered alternatives. Communications on Pure and Applied Mathematics , 1965

  36. [36]

    Spectra of symmetrized shuffling operators, Vol

    Victor Reiner, Franco Saliola, and Volkmar Welker. Spectra of symmetrized shuffling operators, Vol. 228, No. 1072 . American Mathematical Society, 2014

  37. [37]

    The symmetric group: representations, combinatorial algorithms, and symmetric functions , volume 203

    Bruce E Sagan. The symmetric group: representations, combinatorial algorithms, and symmetric functions , volume 203. Springer Science & Business Media, 2013

  38. [38]

    Approximation theorems of mathematical statistics , volume 162

    Robert J Serfling. Approximation theorems of mathematical statistics , volume 162. John Wiley & Sons, 1980

  39. [39]

    Random walk, semi-direct products, and card shuffling

    Jay-Calvin Uyemura Reyes . Random walk, semi-direct products, and card shuffling. Ph.D. thesis, Stanford University. , 2002

  40. [40]

    Goodness-of-fit tests on a circle

    George S Watson. Goodness-of-fit tests on a circle. II . Biometrika , 49(1/2):57--63, 1962

  41. [41]

    Individual comparisons by ranking methods

    Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics , 1(6):80--83, 1945

  42. [42]

    Doron Gepner's Statistics on Words in {1,2,3} is (most probably) Asymptotically Logistic

    Doron Zeilberger. Doron G epner's statistics on words in \ 1, 2, 3 \ is (most probably) asymptotically logistic. arXiv preprint arXiv:1604.00663 , 2016