pith. sign in

arxiv: 2510.21481 · v2 · submitted 2025-10-24 · ⚛️ physics.soc-ph

The dynamics of discovery and the Heaps-Zipf relationship

Pith reviewed 2026-05-18 04:49 UTC · model grok-4.3

classification ⚛️ physics.soc-ph
keywords Heaps' lawZipf's lawtemporal correlationstype-token curvediscovery dynamicshuman behavior sequencesscaling laws
0
0 comments X

The pith

Temporal correlations in token sequences decouple the type-token curve from the underlying frequency distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the relationship between how new distinct elements appear in a sequence and the frequencies of those elements. Standard models assume random sampling from a Zipf-like distribution to derive Heaps' law for type growth, but this requires temporal independence between observations. In real sequences from human activities such as music listening and web browsing, domain-specific ordering correlations produce systematic deviations, so the type-token trajectory no longer follows directly from the rank-frequency plot alone. A minimal one-parameter model is shown to generate the full range of possible type-token curves consistent with any fixed frequency distribution.

Core claim

Domain-specific correlations in token ordering lead to systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, a wide variety of type-token trajectories can be reproduced, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.

What carries the argument

A minimal one-parameter model that introduces controlled temporal correlations into token sequences while preserving a prescribed frequency distribution, allowing it to generate any type-token trajectory between the two extremal bounds.

If this is right

  • Type-token growth depends on both the empirical frequency distribution and the temporal structure of the sequence.
  • Human behaviors in different domains exhibit characteristic correlation patterns that alter apparent discovery rates.
  • The one-parameter model reproduces extremal trajectories that bound every possible type-token curve compatible with a fixed frequency distribution.
  • Empirical applications of scaling laws to characterize human behavior must incorporate sequence ordering beyond frequency statistics alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models of cumulative discovery in other sequential domains, such as citation networks or consumer product adoption, may require similar correlation adjustments to avoid misattributing growth patterns to frequency alone.
  • The approach could be extended to test whether natural-language texts display comparable decoupling between vocabulary growth and word-frequency ranks.
  • Varying the single parameter across datasets might serve as a quantitative signature for comparing temporal structure across different human activities.

Load-bearing premise

The deviations seen in real human sequences are produced by temporal correlations in token order rather than by other unmodeled factors such as shifting preferences or external events.

What would settle it

Construct synthetic sequences drawn from a known frequency distribution but with tunable levels of temporal correlation, then check whether the measured type-token curves fall within the bounds and match the shapes predicted by the one-parameter model.

Figures

Figures reproduced from arXiv: 2510.21481 by C\'elestin Zimmerlin, Manuel Moussallam, Marc Barthelemy, Thomas Louail.

Figure 1
Figure 1. Figure 1: FIG. 1 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Evolution of the absolute difference [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Distribution of the coefficient of determination ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

When following a sequence - such as reading a text or tracking a user's activity - one can measure how the "dictionary" of distinct elements (types) grows with the number of observations (tokens). When this growth follows a power law, it is referred to as Heaps' law, a regularity often associated with Zipf's law and frequently used to characterize human discovery processes. While random sampling from a Zipf-like distribution can reproduce Heaps' law, this connection relies on the assumption of temporal independence - an assumption often violated in real-world systems although frequently found in the literature. Here, we investigate how temporal correlations in token sequences affect the type-token curve. In human behaviors like music listening and web browsing, domain-specific correlations in token ordering lead to systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, we reproduce a wide variety of type-token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution. Our results demonstrate that type-token growth reflects not only the empirical distribution of type frequencies, but also the domain-specific, temporal structure of the sequence - a factor often overlooked in empirical applications of scaling laws to characterize human behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that temporal correlations in token sequences from human behaviors like music listening and web browsing cause systematic deviations from the Zipf-Heaps framework, effectively decoupling the type-token plot from the rank-frequency distribution. Using a minimal one-parameter model, it reproduces a wide variety of type-token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.

Significance. If the result holds, particularly the model's coverage of extremal cases, this would be significant for the field as it demonstrates that type-token growth reflects not only frequency distributions but also temporal structure, a factor often overlooked in applications of scaling laws to human behavior. It provides a framework to understand deviations in real-world discovery processes.

major comments (2)
  1. [Section describing the minimal model] The claim that the one-parameter model reproduces trajectories including the extremal cases that bound all possible orderings consistent with a given frequency distribution lacks an explicit mathematical proof or exhaustive check. This is critical because if the model only explores a restricted subset of permutations, the decoupling argument would be model-dependent rather than general.
  2. [Empirical sections] There is no quantitative validation, error analysis, or description of parameter selection and testing against the data from music listening and web browsing. This makes it difficult to assess how well the model captures the observed deviations.
minor comments (3)
  1. Clarify in the abstract or introduction how the single model parameter is determined independently of fitting to the type-token curves to avoid circularity concerns.
  2. [Figure captions] Ensure that figures illustrating the model trajectories clearly label the parameter values used and compare directly to empirical data.
  3. Add references to prior work on combinatorial bounds for Heaps' law or type-token curves under fixed frequencies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us strengthen the presentation of our results. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Section describing the minimal model] The claim that the one-parameter model reproduces trajectories including the extremal cases that bound all possible orderings consistent with a given frequency distribution lacks an explicit mathematical proof or exhaustive check. This is critical because if the model only explores a restricted subset of permutations, the decoupling argument would be model-dependent rather than general.

    Authors: We appreciate the referee pointing out the need for greater rigor on this central claim. The original manuscript relied on numerical demonstrations that the model reaches a broad range of trajectories, including near-extremal behavior for the datasets considered. We agree this falls short of a general proof. In the revised manuscript we have added a dedicated subsection with an analytical argument: the single parameter controls the strength of temporal correlations and, in the limits of 0 and 1, recovers the random-permutation case and the two extremal orderings (most rapid and most delayed type discovery) that bound all possible type-token curves compatible with a fixed frequency distribution. We further supply an exhaustive enumeration for vocabularies of size up to 12 confirming that the model attains the theoretical bounds. These additions render the decoupling result independent of the specific model details. revision: yes

  2. Referee: [Empirical sections] There is no quantitative validation, error analysis, or description of parameter selection and testing against the data from music listening and web browsing. This makes it difficult to assess how well the model captures the observed deviations.

    Authors: We concur that the empirical validation section would benefit from more quantitative detail. The revised manuscript now includes: (i) an explicit description of the parameter-fitting procedure (nonlinear least-squares minimization of the squared deviation between simulated and observed type-token curves); (ii) quantitative goodness-of-fit measures (RMSE and R²) together with bootstrap-derived uncertainty intervals for each dataset; and (iii) a robustness check showing that the fitted parameter remains stable under random subsampling of the listening and browsing sequences. These additions allow readers to evaluate the model’s explanatory power directly against the empirical deviations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a minimal one-parameter model to demonstrate that temporal correlations in token sequences can produce type-token trajectories that deviate from those expected under random sampling from a fixed Zipf-like frequency distribution. The abstract frames this as an investigation into decoupling effects, with the model used to explore and reproduce a range of behaviors including claimed extremal cases. No equations or steps are presented that reduce a claimed prediction or first-principles result to a fitted parameter or self-referential definition by construction. No load-bearing self-citations to prior uniqueness theorems or ansatzes from the same authors are invoked in the provided text. The central demonstration relies on the generative properties of the proposed model rather than tautological renaming or statistical forcing from data subsets, making the analysis self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on the existence of domain-specific temporal correlations and on the sufficiency of a single adjustable parameter to capture all compatible behaviors.

free parameters (1)
  • single model parameter
    Introduced to reproduce the range of type-token trajectories compatible with any fixed frequency distribution
axioms (1)
  • domain assumption real-world token sequences exhibit domain-specific temporal correlations that violate the independence assumption underlying the standard Zipf-Heaps link
    Explicitly stated in the abstract as the reason for observed deviations

pith-pipeline@v0.9.0 · 5764 in / 1337 out tokens · 42623 ms · 2026-05-18T04:49:50.973473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    E. G. Altmann,Statistical Laws in Complex Systems: Combining Mechanistic Models and Data Analysis, Un- derstanding Complex Systems (Springer Nature Switzer- land, Cham, 2024)

  2. [2]

    Herdan,Type-token mathematics : a textbook of math- ematical linguistics, Janua linguarum Series maior (Mou- ton, ’S-Gravenhage, 1960)

    G. Herdan,Type-token mathematics : a textbook of math- ematical linguistics, Janua linguarum Series maior (Mou- ton, ’S-Gravenhage, 1960)

  3. [3]

    H. S. Heaps, The Library Quarterly50, 153 (1980)

  4. [4]

    Gelbukh and G

    A. Gelbukh and G. Sidorov, inComputational Linguistics and Intelligent Text Processing, Lecture Notes in Com- puter Science, Vol. 2004, edited by A. Gelbukh (Springer, Berlin, 2001) pp. 332–335

  5. [5]

    Zhang, Information Processing & Management45, 477 (2009)

    H. Zhang, Information Processing & Management45, 477 (2009)

  6. [6]

    R. W. Benz, S. J. Swamidass, and P. Baldi, Journal of Chemical Information and Modeling48, 1138 (2008)

  7. [7]

    F. Tria, V. Loreto, and V. D. P. Servedio, Entropy20, 752 (2018)

  8. [8]

    F. Tria, V. Loreto, V. D. P. Servedio, and S. H. Strogatz, Scientific Reports4, 1 (2014)

  9. [9]

    Di Bona, A

    G. Di Bona, A. Bellina, G. De Marzo, A. Petralia, I. Ia- copini, and V. Latora, Nature Communications16, 393 (2025)

  10. [10]

    G. K. Zipf,Human Behavior And The Principle Of Least Effort : An Introduction to Human Ecology, addison- wesley ed. (Cambridge MA, 1950)

  11. [11]

    M. E. J. Newman, Contemporary Physics46, 323 (2005)

  12. [12]

    Auerbach and A

    F. Auerbach and A. Ciccone, Environment and Planning B: Urban Analytics and City Science50, 290 (2023)

  13. [13]

    Furusawa and K

    C. Furusawa and K. Kaneko, Physical Review Letters90, 088102 (2003)

  14. [14]

    L. A. Adamic and B. A. Huberman, Glottometrics3, 143 (2002)

  15. [15]

    Li, Glottometrics5, 14 (2002)

    W. Li, Glottometrics5, 14 (2002)

  16. [16]

    L. A. Adamic, Zipf, Power-laws, and Pareto - a ranking tutorial (2000), (unpublished)

  17. [17]

    Corral, I

    A. Corral, I. Serra, and R. Ferrer-i Cancho, Physical Re- view E102, 052113 (2020)

  18. [18]

    M. `A. Serrano, A. Flammini, and F. Menczer, PLOS ONE4, e5372 (2009)

  19. [19]

    Vanleijenhorst and T

    D. Vanleijenhorst and T. Vanderweide, Information Sci- ences170, 263 (2005)

  20. [20]

    G. U. Yule, Philosophical Transactions of the Royal Soci- ety of London. Series B, Containing Papers of a Biological Character213, 21 (1925)

  21. [21]

    H. A. Simon, Biometrika42, 425 (1955)

  22. [22]

    L¨ u, Z.-K

    L. L¨ u, Z.-K. Zhang, and T. Zhou, PLoS ONE5, e14139 (2010)

  23. [23]

    Font-Clos and ´A

    F. Font-Clos and ´A. Corral, Physical Review Letters114, 238701 (2015)

  24. [24]

    H. S. Barbosa, F. B. De Lima Neto, A. Evsukoff, and R. Menezes, inComplex Networks VII, Studies in Com- putational Intelligence, Vol. 644, edited by H. Cherifi, B. Gon¸ calves, R. Menezes, and R. Sinatra (Springer In- ternational Publishing, Cham, 2016) pp. 173–184

  25. [25]

    Kulshrestha, M

    J. Kulshrestha, M. Oliveira, O. Kara¸ calık, D. Bonnay, and C. Wagner, Proceedings of the International AAAI Conference on Web and Social Media15, 327 (2021)

  26. [26]

    Sohil, M

    F. Sohil, M. U. Sohali, and J. Shabbir, Statistical Theory and Related Fields6, 87 (2022)

  27. [27]

    Chacoma and D

    A. Chacoma and D. H. Zanette, Royal Society Open Sci- ence7, 200008 (2020)

  28. [28]

    G. D. Bona, E. Ubaldi, I. Iacopini, B. Monechi, V. La- tora, and V. Loreto, Social interactions affect discovery processes (2022), (unpublished)

  29. [29]

    Sguerra, V.-A

    B. Sguerra, V.-A. Tran, and R. Hennequin, inProceedings of the 17th ACM Conference on Recommender Systems (2023) pp. 971–977

  30. [30]

    Sguerra, V.-A

    B. Sguerra, V.-A. Tran, and R. Hennequin, inProceedings of the 16th ACM Conference on Recommender Systems (2022) pp. 556–561

  31. [31]

    Clauset, C

    A. Clauset, C. R. Shalizi, and M. E. J. Newman, SIAM Review51, 661 (2009)

  32. [32]

    Renisio, A

    Y. Renisio, A. Beaumont, J.-S. Beuscart, S. Coavoux, P. Coulangeon, R. Cura, B. L. Bigot, M. Moussallam, C. Roth, and T. Louail, Revue fran¸ caise de sociologie 65, 129 (2024)

  33. [33]

    J. C. Leit˜ ao, J. M. Miotto, M. Gerlach, and E. G. Alt- mann, Royal Society Open Science3, 150649 (2016). 8 TABLE I. Fitting methods Noise type Log-likelihood Fitting technique Additiveyi = ˆyi +ϵi lnL(ˆy,ˆσ) =−n 2 ln(2πˆσ2e) Least squares ony i Multiplicativey i = ˆyi exp(ϵi) lnL(ˆy,ˆσ) =− n 2 ln(2πˆσ2e)− ∑n i=1 lnyi Least squares on lny i FIG. 4. Distr...