The dynamics of discovery and the Heaps-Zipf relationship
Pith reviewed 2026-05-18 04:49 UTC · model grok-4.3
The pith
Temporal correlations in token sequences decouple the type-token curve from the underlying frequency distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Domain-specific correlations in token ordering lead to systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, a wide variety of type-token trajectories can be reproduced, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.
What carries the argument
A minimal one-parameter model that introduces controlled temporal correlations into token sequences while preserving a prescribed frequency distribution, allowing it to generate any type-token trajectory between the two extremal bounds.
If this is right
- Type-token growth depends on both the empirical frequency distribution and the temporal structure of the sequence.
- Human behaviors in different domains exhibit characteristic correlation patterns that alter apparent discovery rates.
- The one-parameter model reproduces extremal trajectories that bound every possible type-token curve compatible with a fixed frequency distribution.
- Empirical applications of scaling laws to characterize human behavior must incorporate sequence ordering beyond frequency statistics alone.
Where Pith is reading between the lines
- Models of cumulative discovery in other sequential domains, such as citation networks or consumer product adoption, may require similar correlation adjustments to avoid misattributing growth patterns to frequency alone.
- The approach could be extended to test whether natural-language texts display comparable decoupling between vocabulary growth and word-frequency ranks.
- Varying the single parameter across datasets might serve as a quantitative signature for comparing temporal structure across different human activities.
Load-bearing premise
The deviations seen in real human sequences are produced by temporal correlations in token order rather than by other unmodeled factors such as shifting preferences or external events.
What would settle it
Construct synthetic sequences drawn from a known frequency distribution but with tunable levels of temporal correlation, then check whether the measured type-token curves fall within the bounds and match the shapes predicted by the one-parameter model.
Figures
read the original abstract
When following a sequence - such as reading a text or tracking a user's activity - one can measure how the "dictionary" of distinct elements (types) grows with the number of observations (tokens). When this growth follows a power law, it is referred to as Heaps' law, a regularity often associated with Zipf's law and frequently used to characterize human discovery processes. While random sampling from a Zipf-like distribution can reproduce Heaps' law, this connection relies on the assumption of temporal independence - an assumption often violated in real-world systems although frequently found in the literature. Here, we investigate how temporal correlations in token sequences affect the type-token curve. In human behaviors like music listening and web browsing, domain-specific correlations in token ordering lead to systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, we reproduce a wide variety of type-token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution. Our results demonstrate that type-token growth reflects not only the empirical distribution of type frequencies, but also the domain-specific, temporal structure of the sequence - a factor often overlooked in empirical applications of scaling laws to characterize human behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that temporal correlations in token sequences from human behaviors like music listening and web browsing cause systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, it reproduces a wide variety of type-token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.
Significance. If the result holds, particularly the model's coverage of extremal cases, this would be significant for the field as it demonstrates that type-token growth reflects not only frequency distributions but also temporal structure, a factor often overlooked in applications of scaling laws to human behavior. It provides a framework to understand deviations in real-world discovery processes.
major comments (2)
- [Section describing the minimal model] The claim that the one-parameter model reproduces trajectories including the extremal cases that bound all possible orderings consistent with a given frequency distribution lacks an explicit mathematical proof or exhaustive check. This is critical because if the model only explores a restricted subset of permutations, the decoupling argument would be model-dependent rather than general.
- [Empirical sections] There is no quantitative validation, error analysis, or description of parameter selection and testing against the data from music listening and web browsing. This makes it difficult to assess how well the model captures the observed deviations.
minor comments (3)
- Clarify in the abstract or introduction how the single model parameter is determined independently of fitting to the type-token curves to avoid circularity concerns.
- [Figure captions] Ensure that figures illustrating the model trajectories clearly label the parameter values used and compare directly to empirical data.
- Add references to prior work on combinatorial bounds for Heaps' law or type-token curves under fixed frequencies.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which have helped us strengthen the presentation of our results. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Section describing the minimal model] The claim that the one-parameter model reproduces trajectories including the extremal cases that bound all possible orderings consistent with a given frequency distribution lacks an explicit mathematical proof or exhaustive check. This is critical because if the model only explores a restricted subset of permutations, the decoupling argument would be model-dependent rather than general.
Authors: We appreciate the referee pointing out the need for greater rigor on this central claim. The original manuscript relied on numerical demonstrations that the model reaches a broad range of trajectories, including near-extremal behavior for the datasets considered. We agree this falls short of a general proof. In the revised manuscript we have added a dedicated subsection with an analytical argument: the single parameter controls the strength of temporal correlations and, in the limits of 0 and 1, recovers the random-permutation case and the two extremal orderings (most rapid and most delayed type discovery) that bound all possible type-token curves compatible with a fixed frequency distribution. We further supply an exhaustive enumeration for vocabularies of size up to 12 confirming that the model attains the theoretical bounds. These additions render the decoupling result independent of the specific model details. revision: yes
-
Referee: [Empirical sections] There is no quantitative validation, error analysis, or description of parameter selection and testing against the data from music listening and web browsing. This makes it difficult to assess how well the model captures the observed deviations.
Authors: We concur that the empirical validation section would benefit from more quantitative detail. The revised manuscript now includes: (i) an explicit description of the parameter-fitting procedure (nonlinear least-squares minimization of the squared deviation between simulated and observed type-token curves); (ii) quantitative goodness-of-fit measures (RMSE and R²) together with bootstrap-derived uncertainty intervals for each dataset; and (iii) a robustness check showing that the fitted parameter remains stable under random subsampling of the listening and browsing sequences. These additions allow readers to evaluate the model’s explanatory power directly against the empirical deviations. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a minimal one-parameter model to demonstrate that temporal correlations in token sequences can produce type-token trajectories that deviate from those expected under random sampling from a fixed Zipf-like frequency distribution. The abstract frames this as an investigation into decoupling effects, with the model used to explore and reproduce a range of behaviors including claimed extremal cases. No equations or steps are presented that reduce a claimed prediction or first-principles result to a fitted parameter or self-referential definition by construction. No load-bearing self-citations to prior uniqueness theorems or ansatzes from the same authors are invoked in the provided text. The central demonstration relies on the generative properties of the proposed model rather than tautological renaming or statistical forcing from data subsets, making the analysis self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- single model parameter
axioms (1)
- domain assumption real-world token sequences exhibit domain-specific temporal correlations that violate the independence assumption underlying the standard Zipf-Heaps link
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a minimal one-parameter model, we reproduce a wide variety of type-token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
E. G. Altmann,Statistical Laws in Complex Systems: Combining Mechanistic Models and Data Analysis, Un- derstanding Complex Systems (Springer Nature Switzer- land, Cham, 2024)
work page 2024
-
[2]
G. Herdan,Type-token mathematics : a textbook of math- ematical linguistics, Janua linguarum Series maior (Mou- ton, ’S-Gravenhage, 1960)
work page 1960
-
[3]
H. S. Heaps, The Library Quarterly50, 153 (1980)
work page 1980
-
[4]
A. Gelbukh and G. Sidorov, inComputational Linguistics and Intelligent Text Processing, Lecture Notes in Com- puter Science, Vol. 2004, edited by A. Gelbukh (Springer, Berlin, 2001) pp. 332–335
work page 2004
-
[5]
Zhang, Information Processing & Management45, 477 (2009)
H. Zhang, Information Processing & Management45, 477 (2009)
work page 2009
-
[6]
R. W. Benz, S. J. Swamidass, and P. Baldi, Journal of Chemical Information and Modeling48, 1138 (2008)
work page 2008
-
[7]
F. Tria, V. Loreto, and V. D. P. Servedio, Entropy20, 752 (2018)
work page 2018
-
[8]
F. Tria, V. Loreto, V. D. P. Servedio, and S. H. Strogatz, Scientific Reports4, 1 (2014)
work page 2014
-
[9]
G. Di Bona, A. Bellina, G. De Marzo, A. Petralia, I. Ia- copini, and V. Latora, Nature Communications16, 393 (2025)
work page 2025
-
[10]
G. K. Zipf,Human Behavior And The Principle Of Least Effort : An Introduction to Human Ecology, addison- wesley ed. (Cambridge MA, 1950)
work page 1950
-
[11]
M. E. J. Newman, Contemporary Physics46, 323 (2005)
work page 2005
-
[12]
F. Auerbach and A. Ciccone, Environment and Planning B: Urban Analytics and City Science50, 290 (2023)
work page 2023
- [13]
-
[14]
L. A. Adamic and B. A. Huberman, Glottometrics3, 143 (2002)
work page 2002
- [15]
-
[16]
L. A. Adamic, Zipf, Power-laws, and Pareto - a ranking tutorial (2000), (unpublished)
work page 2000
- [17]
-
[18]
M. `A. Serrano, A. Flammini, and F. Menczer, PLOS ONE4, e5372 (2009)
work page 2009
-
[19]
D. Vanleijenhorst and T. Vanderweide, Information Sci- ences170, 263 (2005)
work page 2005
-
[20]
G. U. Yule, Philosophical Transactions of the Royal Soci- ety of London. Series B, Containing Papers of a Biological Character213, 21 (1925)
work page 1925
-
[21]
H. A. Simon, Biometrika42, 425 (1955)
work page 1955
- [22]
-
[23]
F. Font-Clos and ´A. Corral, Physical Review Letters114, 238701 (2015)
work page 2015
-
[24]
H. S. Barbosa, F. B. De Lima Neto, A. Evsukoff, and R. Menezes, inComplex Networks VII, Studies in Com- putational Intelligence, Vol. 644, edited by H. Cherifi, B. Gon¸ calves, R. Menezes, and R. Sinatra (Springer In- ternational Publishing, Cham, 2016) pp. 173–184
work page 2016
-
[25]
J. Kulshrestha, M. Oliveira, O. Kara¸ calık, D. Bonnay, and C. Wagner, Proceedings of the International AAAI Conference on Web and Social Media15, 327 (2021)
work page 2021
- [26]
-
[27]
A. Chacoma and D. H. Zanette, Royal Society Open Sci- ence7, 200008 (2020)
work page 2020
-
[28]
G. D. Bona, E. Ubaldi, I. Iacopini, B. Monechi, V. La- tora, and V. Loreto, Social interactions affect discovery processes (2022), (unpublished)
work page 2022
-
[29]
B. Sguerra, V.-A. Tran, and R. Hennequin, inProceedings of the 17th ACM Conference on Recommender Systems (2023) pp. 971–977
work page 2023
-
[30]
B. Sguerra, V.-A. Tran, and R. Hennequin, inProceedings of the 16th ACM Conference on Recommender Systems (2022) pp. 556–561
work page 2022
- [31]
-
[32]
Y. Renisio, A. Beaumont, J.-S. Beuscart, S. Coavoux, P. Coulangeon, R. Cura, B. L. Bigot, M. Moussallam, C. Roth, and T. Louail, Revue fran¸ caise de sociologie 65, 129 (2024)
work page 2024
-
[33]
J. C. Leit˜ ao, J. M. Miotto, M. Gerlach, and E. G. Alt- mann, Royal Society Open Science3, 150649 (2016). 8 TABLE I. Fitting methods Noise type Log-likelihood Fitting technique Additiveyi = ˆyi +ϵi lnL(ˆy,ˆσ) =−n 2 ln(2πˆσ2e) Least squares ony i Multiplicativey i = ˆyi exp(ϵi) lnL(ˆy,ˆσ) =− n 2 ln(2πˆσ2e)− ∑n i=1 lnyi Least squares on lny i FIG. 4. Distr...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.