Spectral Analysis of Word Statistics
Pith reviewed 2026-05-24 13:12 UTC · model grok-4.3
The pith
Word subsequence counts in random texts over finite alphabets have their covariance matrices decomposed into explicit eigenvectors and eigenvalues by order.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics.
What carries the argument
Algebraic word operators that decompose the space of word statistics of each order into eigenspaces with explicit covariance eigenvectors and eigenvalues.
If this is right
- Special cases such as intransitive dice and random core partitions are described within the same unified framework.
- Several classical statistical tests receive a common structural description.
- The decomposition opens further applications to data analysis and machine learning.
Where Pith is reading between the lines
- The explicit spectrum may simplify higher-order moment calculations in large sequence datasets.
- Analogous decompositions could be sought for counting statistics in other discrete probability models.
- The approach may yield closed-form expressions for certain random walk functionals.
Load-bearing premise
The joint distribution of word subsequence counts in a random text over a finite alphabet admits an asymptotic structure that allows algebraic decomposition as text length grows.
What would settle it
Generate many random texts over a small alphabet, compute the sample covariance matrix of word subsequence counts for fixed lengths, and check whether its eigenvectors and eigenvalues match the paper's explicit formulas.
read the original abstract
Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude using diverse algebraic tools. Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. Our techniques include and elaborate on a set of algebraic word operators, recently studied and employed by Dieker and Saliola (Adv Math, 2018). Subword counts find applications in Combinatorics, Statistics, and Computer Science. We revisit special cases from the combinatorial literature, such as intransitive dice, random core partitions, and questions on random walk. Our structural approach describes in a unified framework several classical statistical tests. We propose further potential applications to data analysis and machine learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the asymptotic joint distribution of subsequence counts for fixed-length words in random texts over a finite alphabet. It claims to fully characterize the different orders of magnitude of all linear combinations of these subword statistics via algebraic tools, and to establish the spectral decomposition of the space of word statistics of each order, including explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution. The approach elaborates on algebraic word operators from Dieker and Saliola (2018) and applies the framework to classical problems such as intransitive dice, random core partitions, and random walks, while proposing uses in data analysis and machine learning.
Significance. If the explicit spectral formulas and decomposition hold, the work supplies a unified algebraic structure for analyzing dependencies among subsequence statistics, which could streamline several classical statistical tests and combinatorial enumerations. The provision of closed-form eigenvectors/eigenvalues for the limiting covariance at each order is a concrete advance over purely asymptotic or simulation-based approaches in the area.
minor comments (4)
- §2: The definition of the word operators and their action on the space of statistics would benefit from an explicit small-alphabet example (e.g., binary alphabet, length-2 words) to illustrate the claimed commutation relations before the general spectral theorem is stated.
- §4.2, Theorem 4.3: The statement that the eigenvectors are 'parameter-free' should be qualified by noting the dependence on the underlying letter probabilities; the current wording risks suggesting independence from the text model.
- The applications to intransitive dice and random core partitions are sketched but lack a self-contained verification that the general spectral formulas recover the known variances or correlations in those special cases; adding one short calculation per example would strengthen the unified-framework claim.
- Notation: The distinction between the finite-n covariance and its limiting form is occasionally blurred in the text; consistent use of subscripts (e.g., Σ_n vs. Σ) would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the manuscript, the accurate summary of its contributions, and the recommendation for minor revision. No major comments appear in the report.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives explicit eigenvector and eigenvalue formulas for the covariance of word-subsequence statistics by elaborating algebraic operators from the external reference Dieker and Saliola (Adv. Math. 2018). No step reduces a claimed result to a fitted parameter, self-definition, or self-citation chain; the central spectral decomposition is obtained via algebraic manipulation of the cited operators applied to the subsequence-count model, which is independent of the target formulas. The analysis is therefore not circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Random text generated over a finite alphabet yields subsequence word counts with well-defined asymptotic joint distributions
Reference graph
Works this paper leans on
-
[1]
On the distribution of the two-sample C ram \'e r-von M ises criterion
Theodore W Anderson. On the distribution of the two-sample C ram \'e r-von M ises criterion. The Annals of Mathematical Statistics , pages 1148--1159, 1962
work page 1962
-
[2]
Partitions which are simultaneously t_1 -and t_2 -core
Jaclyn Anderson. Partitions which are simultaneously t_1 -and t_2 -core. Discrete Mathematics , 248(1-3):237--243, 2002
work page 2002
-
[3]
Brian Conrey, James Gabbard, Katie Grant, Andrew Liu, and Kent E Morrison. Intransitive dice. Mathematics Magazine , 89(2):133--143, 2016
work page 2016
-
[4]
Integration of paths -- a faithful representation of paths by noncommutative formal power series
Kuo-Tsai Chen. Integration of paths -- a faithful representation of paths by noncommutative formal power series. Transactions of the American Mathematical Society , 89(2):395--407, 1958
work page 1958
-
[5]
A primer on the signature method in machine learning
Ilya Chevyrev and Andrey Kormilitzin. A primer on the signature method in machine learning. arXiv preprint arXiv:1603.03788 , 2016
-
[6]
Group representations in probability and statistics
Persi Diaconis. Group representations in probability and statistics. Lecture notes -- monograph series , 11:i--192, 1988
work page 1988
-
[7]
Spectral analysis of random-to-random markov chains
Anton B Dieker and Franco V Saliola. Spectral analysis of random-to-random markov chains. Advances in Mathematics , 323:427--485, 2018
work page 2018
-
[8]
Orthogonal expansions and U -statistics
GK Eagleson. Orthogonal expansions and U -statistics. Australian Journal of Statistics , 21(3):221--237, 1979
work page 1979
-
[9]
Invariants of random knots and links
Chaim Even -Zohar , Joel Hass, Nati Linial, and Tahl Nowik. Invariants of random knots and links. Discrete & Computational Geometry , 56(2):274--314, 2016
work page 2016
-
[10]
Patterns in random permutations
Chaim Even -Zohar . Patterns in random permutations. Combinatorica , pages 1--30, 2020
work page 2020
-
[11]
Sizes of simultaneous core partitions
Chaim Even -Zohar . Sizes of simultaneous core partitions. arXiv preprint arXiv:2003.13671 , 2020
-
[12]
Shalosh B Ekhad and Doron Zeilberger. Explicit expressions for the variance and higher moments of the size of a simultaneous core partition and its limiting distribution. arXiv preprint arXiv:1508.07637 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Shalosh B Ekhad and Doron Zeilberger. A treatise on sucker's bets. arXiv preprint arXiv:1710.10344 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Representation theory: a first course , volume 129
William Fulton and Joe Harris. Representation theory: a first course , volume 129. Springer Science & Business Media, 2013
work page 2013
-
[15]
Martin Gardner. The colossal book of mathematics: classic puzzles, paradoxes, and problems: number theory, algebra, geometry, probability, topology, game theory, infinity, and other topics of recreational mathematics . WW Norton & Company, 2001
work page 2001
-
[16]
Marches al \'e atoires sur les groupes de L ie , volume 624
Yves Guivarc'h, Michael Keane, and Bernard Roynette. Marches al \'e atoires sur les groupes de L ie , volume 624. Springer, 1977
work page 1977
-
[17]
The probability of intransitivity in dice and close elections
Jan H a z a, Elchanan Mossel, Nathan Ross, and Guangqu Zheng. The probability of intransitivity in dice and close elections. Probability Theory and Related Fields , pages 1--59, 2020
work page 2020
-
[18]
A class of statistics with asymptotically normal distribution
Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics , 19(3):293--325, 1948
work page 1948
-
[19]
Gaussian H ilbert spaces , volume 129
Svante Janson. Gaussian H ilbert spaces , volume 129. Cambridge university press, 1997
work page 1997
-
[20]
Renewal theory for asymmetric U -statistics
Svante Janson. Renewal theory for asymmetric U -statistics. Electronic Journal of Probability , 23, 2018
work page 2018
-
[21]
Svante Janson, Tomasz Luczak, and Andrzej Rucinski. Random graphs , volume 45. John Wiley & Sons, 2011
work page 2011
-
[22]
The asymptotic distributions of generalized U -statistics with applications to random graphs
Svante Janson and Krzysztof Nowicki. The asymptotic distributions of generalized U -statistics with applications to random graphs. Probability theory and related fields , 90(3):341--375, 1991
work page 1991
-
[23]
VS Koroljuk and Yu V Borovskich. Theory of U -statistics . Springer, 1994
work page 1994
-
[24]
K-sample analogues of the K olmogorov-- S mirnov and C ram \'e r--v
J Kiefer. K-sample analogues of the K olmogorov-- S mirnov and C ram \'e r--v. M ises tests. The Annals of Mathematical Statistics , pages 420--447, 1959
work page 1959
-
[25]
U-statistics: Theory and Practice
Justin Lee. U-statistics: Theory and Practice . Citeseer, 1990
work page 1990
-
[26]
Consistency and unbiasedness of certain nonparametric tests
Eric L Lehmann. Consistency and unbiasedness of certain nonparametric tests. The annals of mathematical statistics , pages 165--179, 1951
work page 1951
-
[27]
Wiener's random function, and other L aplacian random functions
Paul L \'e vy. Wiener's random function, and other L aplacian random functions. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability . The Regents of the University of California, 1951
work page 1951
-
[28]
Learning from the past, predicting the statistics for the future, learning an evolving system
Daniel Levin, Terry Lyons, and Hao Ni. Learning from the past, predicting the statistics for the future, learning an evolving system. arXiv preprint arXiv:1309.0260 , 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[29]
Differential equations driven by rough signals
Terry J Lyons. Differential equations driven by rough signals. Revista Matem \'a tica Iberoamericana , 14(2):215--310, 1998
work page 1998
-
[30]
On the distribution of the area enclosed by a random walk on Z^2
James A Mingo and Alexandru Nica. On the distribution of the area enclosed by a random walk on Z^2 . Journal of Combinatorial Theory, Series A , 84(1):55--86, 1998
work page 1998
-
[31]
On a test of whether one of two random variables is stochastically larger than the other
Henry B Mann and Donald R Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics , pages 50--60, 1947
work page 1947
-
[32]
Karl Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 50(302):157--175, 1900
work page 1900
-
[33]
A new way to obtain W atson's U^2
Tore Persson. A new way to obtain W atson's U^2 . Scandinavian Journal of Statistics , pages 119--122, 1979
work page 1979
- [34]
-
[35]
Some distribution-free k-sample rank tests of homogeneity against ordered alternatives
Madan L Puri. Some distribution-free k-sample rank tests of homogeneity against ordered alternatives. Communications on Pure and Applied Mathematics , 1965
work page 1965
-
[36]
Spectra of symmetrized shuffling operators, Vol
Victor Reiner, Franco Saliola, and Volkmar Welker. Spectra of symmetrized shuffling operators, Vol. 228, No. 1072 . American Mathematical Society, 2014
work page 2014
-
[37]
The symmetric group: representations, combinatorial algorithms, and symmetric functions , volume 203
Bruce E Sagan. The symmetric group: representations, combinatorial algorithms, and symmetric functions , volume 203. Springer Science & Business Media, 2013
work page 2013
-
[38]
Approximation theorems of mathematical statistics , volume 162
Robert J Serfling. Approximation theorems of mathematical statistics , volume 162. John Wiley & Sons, 1980
work page 1980
-
[39]
Random walk, semi-direct products, and card shuffling
Jay-Calvin Uyemura Reyes . Random walk, semi-direct products, and card shuffling. Ph.D. thesis, Stanford University. , 2002
work page 2002
-
[40]
Goodness-of-fit tests on a circle
George S Watson. Goodness-of-fit tests on a circle. II . Biometrika , 49(1/2):57--63, 1962
work page 1962
-
[41]
Individual comparisons by ranking methods
Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics , 1(6):80--83, 1945
work page 1945
-
[42]
Doron Gepner's Statistics on Words in {1,2,3} is (most probably) Asymptotically Logistic
Doron Zeilberger. Doron G epner's statistics on words in \ 1, 2, 3 \ is (most probably) asymptotically logistic. arXiv preprint arXiv:1604.00663 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.