The Unseen Species Problem Revisited

Edward Eriksson

arxiv: 2602.08769 · v3 · submitted 2026-02-09 · 🧮 math.ST · stat.TH

The Unseen Species Problem Revisited

Edward Eriksson This is my paper

Pith reviewed 2026-05-16 05:27 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords unseen speciesGood-Toulmin estimatorprediction intervalsspecies samplingpower-law tailincidence dataStein's methodconcentration inequalities

0 comments

The pith

For small m the Good-Toulmin estimator is the only symmetry-respecting method that achieves non-trivial rate in predicting new outcomes after n samples, and principled intervals are now available for it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The unseen species problem asks how many new distinct items will appear in m further draws from an unknown discrete distribution after n i.i.d. observations. For small m the paper shows that only the Good-Toulmin estimator satisfies both the natural symmetry requirements of the sampling process and a non-trivial convergence rate, while also supplying the first valid prediction intervals. For intermediate m a new estimator is introduced that sharply lowers worst-case mean squared error relative to earlier proposals. For large m, under an assumed power-law tail, a simple frequency-based estimator matches the best known rates, improves empirical accuracy, and comes with pre-asymptotic bounds. The same rate results are extended to incidence data when each observed set has bounded size, using concentration inequalities derived via Stein's method.

Core claim

In the unseen species problem the Good-Toulmin estimator is the unique estimator that respects the symmetries of the sampling model and attains a non-trivial rate for small m; principled prediction intervals are constructed for it. A new estimator for intermediate m improves worst-case MSE, while for large m a simple estimator matches existing rates with better empirical performance and pre-asymptotic guarantees under power-law tails. The rate guarantees carry over to incidence data of bounded set size without further independence assumptions.

What carries the argument

The Good-Toulmin estimator, which predicts new species by a linear combination of observed frequencies that is invariant under relabeling of unseen outcomes.

If this is right

Only the Good-Toulmin estimator satisfies both symmetry and non-trivial rate for small m.
Valid prediction intervals can now be attached to Good-Toulmin estimates.
The new intermediate-m estimator reduces worst-case mean squared error compared with prior methods.
Under power-law tails a simple estimator matches optimal rates for large m while offering better finite-sample behavior and pre-asymptotic bounds.
The same rates hold for incidence data when every observed set has bounded cardinality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The symmetry argument may extend to other nonparametric prediction tasks that involve exchangeable observations of discrete types.
The Stein-derived concentration bounds could be reused for other functionals of sequences of set-valued random variables.
In practice the power-law assumption for large m can be checked against the observed frequency tail before applying the simple estimator.
The incidence-data extension removes the need for independence across samples, which broadens applicability to grouped or clustered observations.

Load-bearing premise

The n samples are independent and identically distributed draws from an unknown discrete distribution.

What would settle it

Generate repeated trials from a known discrete distribution such as a geometric or Zipf law, compute the Good-Toulmin predictions and intervals for small m, and check whether any competing estimator that respects symmetry achieves a strictly better rate or whether the intervals fail to cover at the claimed level.

Figures

Figures reproduced from arXiv: 2602.08769 by Edward Eriksson.

**Figure 2.** Figure 2: Empirical Comparison of Performance of Estimators [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗

**Figure 3.** Figure 3: Checking whether the sufficient condition for uniqueness of the [PITH_FULL_IMAGE:figures/full_fig_p066_3.png] view at source ↗

read the original abstract

Given $n$ i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in $m$ additional samples. For small $m$ we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate $m$ we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large $m$ we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate guarantees to incidence data, without further independence assumptions, provided that the sets are of bounded size. In the process we use Stein's method to obtain concentration inequalities for some natural functionals of sequences of i.i.d. discrete-set-valued random variables which are of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Eriksson shows Good-Toulmin is unique under symmetry for small m and supplies the first principled intervals, plus a new intermediate-m estimator with better worst-case MSE.

read the letter

Eriksson shows that the Good-Toulmin estimator is the unique symmetry-respecting choice with non-trivial rate for small m and finally supplies principled prediction intervals for it. That is the central advance. The paper also gives a new estimator for intermediate m that improves worst-case MSE over prior methods, and for large m it matches rates under a power-law tail while showing better empirical results and pre-asymptotic bounds. The Stein-method concentration inequalities for functionals of i.i.d. set-valued sequences are a clean technical addition that lets the rate guarantees extend to incidence data with only a bounded-set-size condition. The assumptions are stated plainly up front and the claims avoid circularity by resting on external i.i.d. structure and standard tools rather than self-referential fitting. The regime split is useful and the work stays within its stated limits. The main soft spots are that the abstract only sketches the proofs, so the full derivations for uniqueness, interval construction, and error-bar handling need checking for any overlooked steps. The power-law tail for large m is a common modeling choice but restricts the scope if the tail differs. These are ordinary verification points rather than load-bearing flaws. The paper is for statisticians working on species sampling, vocabulary estimation, or related problems in ecology and machine learning who want theoretical backing and concrete estimators. Readers who care about guarantees and open-problem resolutions will get direct value. It deserves peer review because it closes a stated gap with new, checkable results.

Referee Report

0 major / 3 minor

Summary. The paper revisits the unseen species problem of predicting new outcomes in m additional samples given n i.i.d. draws from an unknown discrete distribution. It partitions into regimes: for small m it proves uniqueness of the Good-Toulmin estimator under problem symmetries together with a non-trivial rate and constructs principled prediction intervals; for intermediate m it introduces a new estimator with substantially better worst-case MSE; for large m, under a power-law tail, a simple estimator matches the rate of recent methods while improving empirical performance and supplying pre-asymptotic guarantees. The work extends the rate results to incidence data of bounded set size via Stein-method concentration inequalities for functionals of i.i.d. set-valued random variables.

Significance. If the central claims hold, the contribution is notable: it resolves the open problem of principled prediction intervals for the Good-Toulmin estimator, establishes uniqueness under explicit symmetries, supplies a new estimator with improved worst-case guarantees, and furnishes Stein-method tools of potential independent interest for concentration on discrete-set sequences. The regime-specific analysis and extension beyond strict i.i.d. incidence data strengthen applicability in ecology and statistics.

minor comments (3)

[Large-m section] In the large-m regime, the precise statement of the power-law tail assumption (e.g., the range of the exponent) should be stated explicitly before the rate theorem, together with the dependence of the pre-asymptotic constants on that exponent.
[Intermediate-m section] The claim that the new intermediate-m estimator 'vastly improves' worst-case MSE would benefit from a short table comparing its MSE bound to the Good-Toulmin and Chao estimators under the same worst-case distribution family.
[Incidence-data extension] Notation for the incidence-data extension (e.g., the definition of the bounded-size incidence sets) should be introduced once in a dedicated subsection rather than inline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The referee's summary correctly identifies the main contributions, including the uniqueness result for the Good-Toulmin estimator, the construction of principled prediction intervals, the improved estimator for intermediate regimes, the rate-matching result under power-law tails, and the Stein-method extensions to bounded-size incidence data. We will incorporate minor revisions to address any editorial or expository points in the next version.

Circularity Check

0 steps flagged

No significant circularity; derivations rest on external assumptions and Stein's method

full rationale

The paper partitions the unseen species problem into regimes and derives uniqueness of the Good-Toulmin estimator for small m from explicit symmetries plus non-trivial rate conditions, without any equation reducing the estimator to a fitted parameter or self-definition. Intermediate-m and large-m estimators are constructed directly from the stated i.i.d. discrete sampling model and (for large m) an externally assumed power-law tail; pre-asymptotic guarantees and incidence-data extensions are obtained via Stein's method concentration inequalities whose validity is independent of the target results. No load-bearing self-citation, ansatz smuggling, or renaming of known patterns appears in the derivation chain. All core claims remain falsifiable once the external i.i.d. and bounded-set-size assumptions are granted.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

Central claims rest on the i.i.d. sampling assumption and a power-law tail for the large-m regime; no explicit free parameters are introduced in the abstract, and no new entities are postulated.

axioms (3)

domain assumption n samples are i.i.d. from an unknown discrete distribution over an unknown set
Stated at the opening of the abstract as the problem setup.
domain assumption For large m the distribution has a power-law tail
Explicitly adopted following previous authors for the large-m regime.
domain assumption Incidence sets are of bounded size
Required for the extension of rate guarantees to incidence data.

pith-pipeline@v0.9.0 · 5480 in / 1471 out tokens · 32012 ms · 2026-05-16T05:27:49.750717+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For small m we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend the rate guarantees to incidence data... using Stein's method to obtain concentration inequalities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

‘Optimal pre- diction of the number of unseen species’

Alon Orlitsky, Ananda Theertha Suresh and Yihong Wu. ‘Optimal pre- diction of the number of unseen species’. In:Proceedings of the National Academy of Sciences113 (Nov. 2016), pp. 13283–13288.doi:10 . 1073 / pnas.1607774113

work page 2016
[2]

Marcel B¨ ohme. ‘STADS’. In:ACM Transactions on Software Engineering and Methodology27 (July 2018), pp. 1–52.doi:10.1145/3210309

work page doi:10.1145/3210309 2018
[3]

‘Victim identification from the September 11, 2001 attack on the World Trade Center: Past trends and future projections’

Bradley Adams et al. ‘Victim identification from the September 11, 2001 attack on the World Trade Center: Past trends and future projections’. In: Forensic Science International340 (Nov. 2022), p. 111463.doi:10.1016/ j.forsciint.2022.111463.url:https://pubmed.ncbi.nlm.nih.gov/ 36150280/(visited on 09/02/2026)

work page arXiv 2001
[4]

‘Estimating the Number of Unseen Spe- cies: How Many Words Did Shakespeare Know?’ In:Biometrika63 (Dec

Bradley Efron and Ronald Thisted. ‘Estimating the Number of Unseen Spe- cies: How Many Words Did Shakespeare Know?’ In:Biometrika63 (Dec. 1976), p. 435.doi:10.2307/2335721

work page doi:10.2307/2335721 1976
[5]

Timothy Daley and Andrew D. Smith. ‘Modeling genome coverage in single- cell sequencing’. In:Bioinformatics30 (Aug. 2014), pp. 3159–3165.doi: 10.1093/bioinformatics/btu540

work page doi:10.1093/bioinformatics/btu540 2014
[6]

‘Rarefaction and extrapolation with Hill numbers: a frame- work for sampling and estimation in species diversity studies’

Anne Chao et al. ‘Rarefaction and extrapolation with Hill numbers: a frame- work for sampling and estimation in species diversity studies’. In:Ecological Monographs84 (Feb. 2014), pp. 45–67.doi:10.1890/13-0133.1

work page doi:10.1890/13-0133.1 2014
[7]

‘A Good-Turing estimator for feature allocation models’

Fadhel Ayed et al. ‘A Good-Turing estimator for feature allocation models’. In:Electronic Journal of Statistics13 (2019).doi:10.1214/19-ejs1614. 30

work page doi:10.1214/19-ejs1614 2019
[8]

URL https://doi.org/10.1080/ 01621459.2017.1307116

Harry Crane and Walter Dempsey. ‘Edge Exchangeable Models for Interac- tion Networks’. In:J Am Stat Assoc113 (June 2018), pp. 1311–1326.doi: 10.1080/01621459.2017.1341413

work page doi:10.1080/01621459.2017.1341413 2018
[9]

‘Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list’

Anne Chao et al. ‘Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list’. In:Methods in Ecology and Evolution8 (Apr. 2017). Ed. by David Murrell, pp. 1221–1232.doi:10.1111/2041-210x.12768

work page doi:10.1111/2041-210x.12768 2017
[10]

doi: https://doi.org/10.1007/s10955- 021-02765-7

Svante Janson. ‘On Edge Exchangeable Random Graphs’. In:Journal of Statistical Physics173 (June 2017), pp. 448–484.doi:10.1007/s10955- 017-1832-9

work page doi:10.1007/s10955- 2017
[11]

arXiv.org, 2025.url:https://arxiv.org/abs/2501

Edward Eriksson.Edge Exchangeable Graphs: Connectedness, Gaussianity and Completeness. arXiv.org, 2025.url:https://arxiv.org/abs/2501. 09511(visited on 05/05/2025)

work page 2025
[12]

I. J. GOOD and G. H. TOULMIN. ‘THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED’. In:Biometrika43 (1956), pp. 45–63.doi:10 . 1093 / biomet/43.1-2.45

work page 1956
[13]

arXiv.org, 2025.url:https://arxiv.org/abs/2501.16008(visited on 30/09/2025)

Claudia Contardi, Emanuele Dolera and Stefano Favaro.Gaussian credible intervals in Bayesian nonparametric estimation of the unseen. arXiv.org, 2025.url:https://arxiv.org/abs/2501.16008(visited on 30/09/2025)

work page arXiv 2025
[14]

Ohannessian

Anna Ben-Hamou, St´ ephane Boucheron and Mesrob I. Ohannessian. ‘Con- centration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications’. In:Bernoulli23 (Feb. 2017).doi: 10 . 3150 / 15 - bej743.url:https : / / projecteuclid . org / journals / bernoulli / volume - 23 / issue - 1 / Concentration - inequalities -...

work page 2017
[15]

‘Near-optimal estimation of the unseen under regularly varying tail populations’

Stefano Favaro and Zacharie Naulet. ‘Near-optimal estimation of the unseen under regularly varying tail populations’. In:Bernoulli29 (Nov. 2023).doi: 10.3150/23-bej1589

work page doi:10.3150/23-bej1589 2023
[16]

‘Bounded size biased coup- lings, log concave distributions and concentration of measure for occupancy models’

Jay Bartroff, Larry Goldstein and ¨Umit I¸ slak. ‘Bounded size biased coup- lings, log concave distributions and concentration of measure for occupancy models’. In:Bernoulli24 (Nov. 2018).doi:10.3150/17-bej961

work page doi:10.3150/17-bej961 2018
[17]

‘Chebyshev polynomials, moment matching, and optimal estimation of the unseen’

Yihong Wu and Pengkun Yang. ‘Chebyshev polynomials, moment matching, and optimal estimation of the unseen’. In:Annals of statistics47 (Apr. 2019).doi:10.1214/17-aos1665. 31

work page doi:10.1214/17-aos1665 2019
[18]

‘Size biased coup- lings and the spectral gap for random regular graphs’

Nicholas Cook, Larry Goldstein and Tobias Johnson. ‘Size biased coup- lings and the spectral gap for random regular graphs’. In:The Annals of Probability46 (Jan. 2018).doi:10 . 1214 / 17 - aop1180.url:https : / / projecteuclid . org / journals / annals - of - probability / volume - 46 / issue - 1 / Size - biased - couplings - and - the - spectral - gap - ...

work page doi:10.1214/17-aop1180.full 2018
[19]

Jung, P.R

Eugene Yu et al. ‘Spatial occurrence records and distributions of tropical Asian butterflies’. In:Scientific Data12 (June 2025).doi:10.1038/s41597- 025- 05333- w.url:https://www.nature.com/articles/s41597- 025- 05333-w

work page doi:10.1038/s41597- 2025
[20]

Pietro Panzarasa, Tore Opsahl and Kathleen M. Carley. ‘Patterns and dy- namics of users’ behavior and interaction: Network analysis of an online community’. In:Journal of the American Society for Information Science and Technology60 (May 2009), pp. 911–932.doi:10.1002/asi.21015

work page doi:10.1002/asi.21015 2009
[21]

Git- Hub, Apr

Josh Birnholz.GitHub - joshbirnholz/Magic-Booster-Pack-Generator. Git- Hub, Apr. 2025.url:https : / / github . com / joshbirnholz / Magic - Booster-Pack-Generator(visited on 31/01/2026)

work page 2025
[22]

‘The Index of Dispersion Test for the Bivariate Poisson Distribution’

S Loukas and Kemp C D. ‘The Index of Dispersion Test for the Bivariate Poisson Distribution’. In:Biometrics42 (1986), pp. 941–948.doi:10.2307/ 2530708.url:https://www.jstor.org/stable/2530708

work page arXiv 1986
[23]

Springer, Cop, 2011.url:https://www.math.utoronto.ca/ almut/Brezis.pdf(visited on 29/01/2026)

Ha¨ ım Br´ ezis.Functional analysis, Sobolev spaces and partial differential equations. Springer, Cop, 2011.url:https://www.math.utoronto.ca/ almut/Brezis.pdf(visited on 29/01/2026)

work page 2011
[24]

Springer Berlin Heidelberg, 2008.doi: 10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 10.1007/978-3-540-74013-1(visited on 29/01/2026)

Michael Struwe.Variational Methods. Springer Berlin Heidelberg, 2008.doi: 10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 10.1007/978-3-540-74013-1(visited on 29/01/2026)

work page doi:10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 2008
[25]

‘Functional central limit theor- ems for certain statistics in an infinite urn scheme’

Mikhail Chebunin and Artyom Kovalevskii. ‘Functional central limit theor- ems for certain statistics in an infinite urn scheme’. In:Statistics & Probab- ility Letters119 (Dec. 2016), pp. 344–348.doi:10.1016/j.spl.2016.08. 019. 8 Appendix 1: Proofs 8.1 General Facts We collect some facts which are not associated with any particular regime but see frequent ...

work page doi:10.1016/j.spl.2016.08 2016
[26]

43 Lemma 55

Multiply the first lemma on both sides by 16 17 and the latter by 1 17 before adding. 43 Lemma 55. sup µ E[St,T − ˆSt,T ]2 ≥ 1 16 Yb. Moreover, suppose thatb(p)≤1 for allp. Then sup µ E[St,T − ˆSt,T ]2 ≥m 2 pYb −2m p p Yb + 1. Proof.We will lower bound the supremum by finding aµwith large squared bias. Because the supremum overpis over a compact set, the ...

work page
[27]

For strictness, letH (1), H(2) be as hypothesized

which is convex (but not strictly convex) inH. For strictness, letH (1), H(2) be as hypothesized. Then, for someq ∗ ̸= 0 sup q∈[0,1] e−qt q (1 +e rqt +g (θH (1)+(1−θ)H (2))2(qt)), = e−q∗t q∗ (1 +e rq∗t +g (θH (1)+(1−θ)H (2))2(q∗t)). Applying strict convexity at thisq ∗ ̸= 0 now propagates to gives the result as sums of convex and strictly convex functions...

work page
[28]

lim r→∞ 1 r (r1 2 − r 1 2 e 1 2 (r+r 2)−1 +e − 1 2 −r 2 1 2 r e 1 2 − 1 2 −1−(1−e −r 1 2 )), = 1 2 − r 1 2(e 1 2 −1) r e 1 2 − 3 2 , ≈0.28>0

Takex ∗ = 1 2. lim r→∞ 1 r (r1 2 − r 1 2 e 1 2 (r+r 2)−1 +e − 1 2 −r 2 1 2 r e 1 2 − 1 2 −1−(1−e −r 1 2 )), = 1 2 − r 1 2(e 1 2 −1) r e 1 2 − 3 2 , ≈0.28>0. 50 Thus for large enoughr 0 the choicex ∗ = 1 2 will do. Next let ˜p= x∗ t . 1 t sup p∈[0,1] e−pt p |1−e −rpt −g H(pt)| ≥ e−˜pt t˜p|1−e −r˜pt−g H(˜pt)|, = e−x∗ x∗ |1−e −rx∗ −g H(x∗)|, ≥ e−x∗ x∗ (gH(x∗...

work page
[29]

We continue withP(E c n ∩A n ∩B n)

In the following, we will show that also the remaining two probabilities can be bounded from above by ϵ 3. We continue withP(E c n ∩A n ∩B n). Similarly as in [15], we rewrite ˆSt,T −S t,T = St((1 +r) ˆα−1)−(S T −S t) = St(1 +r) α (1 +r) ˆα−α−1 + (St(1 +r) α −S T ) ≤St(1 +r) α log(1 +r) sup a∈[ ˆα,α] elog(1+r)(a−α) |ˆα−α|+|(S t(1 +r) α −S T )|. (22) On th...

work page 2008

[1] [1]

‘Optimal pre- diction of the number of unseen species’

Alon Orlitsky, Ananda Theertha Suresh and Yihong Wu. ‘Optimal pre- diction of the number of unseen species’. In:Proceedings of the National Academy of Sciences113 (Nov. 2016), pp. 13283–13288.doi:10 . 1073 / pnas.1607774113

work page 2016

[2] [2]

Marcel B¨ ohme. ‘STADS’. In:ACM Transactions on Software Engineering and Methodology27 (July 2018), pp. 1–52.doi:10.1145/3210309

work page doi:10.1145/3210309 2018

[3] [3]

‘Victim identification from the September 11, 2001 attack on the World Trade Center: Past trends and future projections’

Bradley Adams et al. ‘Victim identification from the September 11, 2001 attack on the World Trade Center: Past trends and future projections’. In: Forensic Science International340 (Nov. 2022), p. 111463.doi:10.1016/ j.forsciint.2022.111463.url:https://pubmed.ncbi.nlm.nih.gov/ 36150280/(visited on 09/02/2026)

work page arXiv 2001

[4] [4]

‘Estimating the Number of Unseen Spe- cies: How Many Words Did Shakespeare Know?’ In:Biometrika63 (Dec

Bradley Efron and Ronald Thisted. ‘Estimating the Number of Unseen Spe- cies: How Many Words Did Shakespeare Know?’ In:Biometrika63 (Dec. 1976), p. 435.doi:10.2307/2335721

work page doi:10.2307/2335721 1976

[5] [5]

Timothy Daley and Andrew D. Smith. ‘Modeling genome coverage in single- cell sequencing’. In:Bioinformatics30 (Aug. 2014), pp. 3159–3165.doi: 10.1093/bioinformatics/btu540

work page doi:10.1093/bioinformatics/btu540 2014

[6] [6]

‘Rarefaction and extrapolation with Hill numbers: a frame- work for sampling and estimation in species diversity studies’

Anne Chao et al. ‘Rarefaction and extrapolation with Hill numbers: a frame- work for sampling and estimation in species diversity studies’. In:Ecological Monographs84 (Feb. 2014), pp. 45–67.doi:10.1890/13-0133.1

work page doi:10.1890/13-0133.1 2014

[7] [7]

‘A Good-Turing estimator for feature allocation models’

Fadhel Ayed et al. ‘A Good-Turing estimator for feature allocation models’. In:Electronic Journal of Statistics13 (2019).doi:10.1214/19-ejs1614. 30

work page doi:10.1214/19-ejs1614 2019

[8] [8]

URL https://doi.org/10.1080/ 01621459.2017.1307116

Harry Crane and Walter Dempsey. ‘Edge Exchangeable Models for Interac- tion Networks’. In:J Am Stat Assoc113 (June 2018), pp. 1311–1326.doi: 10.1080/01621459.2017.1341413

work page doi:10.1080/01621459.2017.1341413 2018

[9] [9]

‘Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list’

Anne Chao et al. ‘Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list’. In:Methods in Ecology and Evolution8 (Apr. 2017). Ed. by David Murrell, pp. 1221–1232.doi:10.1111/2041-210x.12768

work page doi:10.1111/2041-210x.12768 2017

[10] [10]

doi: https://doi.org/10.1007/s10955- 021-02765-7

Svante Janson. ‘On Edge Exchangeable Random Graphs’. In:Journal of Statistical Physics173 (June 2017), pp. 448–484.doi:10.1007/s10955- 017-1832-9

work page doi:10.1007/s10955- 2017

[11] [11]

arXiv.org, 2025.url:https://arxiv.org/abs/2501

Edward Eriksson.Edge Exchangeable Graphs: Connectedness, Gaussianity and Completeness. arXiv.org, 2025.url:https://arxiv.org/abs/2501. 09511(visited on 05/05/2025)

work page 2025

[12] [12]

I. J. GOOD and G. H. TOULMIN. ‘THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED’. In:Biometrika43 (1956), pp. 45–63.doi:10 . 1093 / biomet/43.1-2.45

work page 1956

[13] [13]

arXiv.org, 2025.url:https://arxiv.org/abs/2501.16008(visited on 30/09/2025)

Claudia Contardi, Emanuele Dolera and Stefano Favaro.Gaussian credible intervals in Bayesian nonparametric estimation of the unseen. arXiv.org, 2025.url:https://arxiv.org/abs/2501.16008(visited on 30/09/2025)

work page arXiv 2025

[14] [14]

Ohannessian

Anna Ben-Hamou, St´ ephane Boucheron and Mesrob I. Ohannessian. ‘Con- centration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications’. In:Bernoulli23 (Feb. 2017).doi: 10 . 3150 / 15 - bej743.url:https : / / projecteuclid . org / journals / bernoulli / volume - 23 / issue - 1 / Concentration - inequalities -...

work page 2017

[15] [15]

‘Near-optimal estimation of the unseen under regularly varying tail populations’

Stefano Favaro and Zacharie Naulet. ‘Near-optimal estimation of the unseen under regularly varying tail populations’. In:Bernoulli29 (Nov. 2023).doi: 10.3150/23-bej1589

work page doi:10.3150/23-bej1589 2023

[16] [16]

‘Bounded size biased coup- lings, log concave distributions and concentration of measure for occupancy models’

Jay Bartroff, Larry Goldstein and ¨Umit I¸ slak. ‘Bounded size biased coup- lings, log concave distributions and concentration of measure for occupancy models’. In:Bernoulli24 (Nov. 2018).doi:10.3150/17-bej961

work page doi:10.3150/17-bej961 2018

[17] [17]

‘Chebyshev polynomials, moment matching, and optimal estimation of the unseen’

Yihong Wu and Pengkun Yang. ‘Chebyshev polynomials, moment matching, and optimal estimation of the unseen’. In:Annals of statistics47 (Apr. 2019).doi:10.1214/17-aos1665. 31

work page doi:10.1214/17-aos1665 2019

[18] [18]

‘Size biased coup- lings and the spectral gap for random regular graphs’

Nicholas Cook, Larry Goldstein and Tobias Johnson. ‘Size biased coup- lings and the spectral gap for random regular graphs’. In:The Annals of Probability46 (Jan. 2018).doi:10 . 1214 / 17 - aop1180.url:https : / / projecteuclid . org / journals / annals - of - probability / volume - 46 / issue - 1 / Size - biased - couplings - and - the - spectral - gap - ...

work page doi:10.1214/17-aop1180.full 2018

[19] [19]

Jung, P.R

Eugene Yu et al. ‘Spatial occurrence records and distributions of tropical Asian butterflies’. In:Scientific Data12 (June 2025).doi:10.1038/s41597- 025- 05333- w.url:https://www.nature.com/articles/s41597- 025- 05333-w

work page doi:10.1038/s41597- 2025

[20] [20]

Pietro Panzarasa, Tore Opsahl and Kathleen M. Carley. ‘Patterns and dy- namics of users’ behavior and interaction: Network analysis of an online community’. In:Journal of the American Society for Information Science and Technology60 (May 2009), pp. 911–932.doi:10.1002/asi.21015

work page doi:10.1002/asi.21015 2009

[21] [21]

Git- Hub, Apr

Josh Birnholz.GitHub - joshbirnholz/Magic-Booster-Pack-Generator. Git- Hub, Apr. 2025.url:https : / / github . com / joshbirnholz / Magic - Booster-Pack-Generator(visited on 31/01/2026)

work page 2025

[22] [22]

‘The Index of Dispersion Test for the Bivariate Poisson Distribution’

S Loukas and Kemp C D. ‘The Index of Dispersion Test for the Bivariate Poisson Distribution’. In:Biometrics42 (1986), pp. 941–948.doi:10.2307/ 2530708.url:https://www.jstor.org/stable/2530708

work page arXiv 1986

[23] [23]

Springer, Cop, 2011.url:https://www.math.utoronto.ca/ almut/Brezis.pdf(visited on 29/01/2026)

Ha¨ ım Br´ ezis.Functional analysis, Sobolev spaces and partial differential equations. Springer, Cop, 2011.url:https://www.math.utoronto.ca/ almut/Brezis.pdf(visited on 29/01/2026)

work page 2011

[24] [24]

Springer Berlin Heidelberg, 2008.doi: 10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 10.1007/978-3-540-74013-1(visited on 29/01/2026)

Michael Struwe.Variational Methods. Springer Berlin Heidelberg, 2008.doi: 10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 10.1007/978-3-540-74013-1(visited on 29/01/2026)

work page doi:10.1007/978-3-540-74013-1.url:https://link.springer.com/book/ 2008

[25] [25]

‘Functional central limit theor- ems for certain statistics in an infinite urn scheme’

Mikhail Chebunin and Artyom Kovalevskii. ‘Functional central limit theor- ems for certain statistics in an infinite urn scheme’. In:Statistics & Probab- ility Letters119 (Dec. 2016), pp. 344–348.doi:10.1016/j.spl.2016.08. 019. 8 Appendix 1: Proofs 8.1 General Facts We collect some facts which are not associated with any particular regime but see frequent ...

work page doi:10.1016/j.spl.2016.08 2016

[26] [26]

43 Lemma 55

Multiply the first lemma on both sides by 16 17 and the latter by 1 17 before adding. 43 Lemma 55. sup µ E[St,T − ˆSt,T ]2 ≥ 1 16 Yb. Moreover, suppose thatb(p)≤1 for allp. Then sup µ E[St,T − ˆSt,T ]2 ≥m 2 pYb −2m p p Yb + 1. Proof.We will lower bound the supremum by finding aµwith large squared bias. Because the supremum overpis over a compact set, the ...

work page

[27] [27]

For strictness, letH (1), H(2) be as hypothesized

which is convex (but not strictly convex) inH. For strictness, letH (1), H(2) be as hypothesized. Then, for someq ∗ ̸= 0 sup q∈[0,1] e−qt q (1 +e rqt +g (θH (1)+(1−θ)H (2))2(qt)), = e−q∗t q∗ (1 +e rq∗t +g (θH (1)+(1−θ)H (2))2(q∗t)). Applying strict convexity at thisq ∗ ̸= 0 now propagates to gives the result as sums of convex and strictly convex functions...

work page

[28] [28]

lim r→∞ 1 r (r1 2 − r 1 2 e 1 2 (r+r 2)−1 +e − 1 2 −r 2 1 2 r e 1 2 − 1 2 −1−(1−e −r 1 2 )), = 1 2 − r 1 2(e 1 2 −1) r e 1 2 − 3 2 , ≈0.28>0

Takex ∗ = 1 2. lim r→∞ 1 r (r1 2 − r 1 2 e 1 2 (r+r 2)−1 +e − 1 2 −r 2 1 2 r e 1 2 − 1 2 −1−(1−e −r 1 2 )), = 1 2 − r 1 2(e 1 2 −1) r e 1 2 − 3 2 , ≈0.28>0. 50 Thus for large enoughr 0 the choicex ∗ = 1 2 will do. Next let ˜p= x∗ t . 1 t sup p∈[0,1] e−pt p |1−e −rpt −g H(pt)| ≥ e−˜pt t˜p|1−e −r˜pt−g H(˜pt)|, = e−x∗ x∗ |1−e −rx∗ −g H(x∗)|, ≥ e−x∗ x∗ (gH(x∗...

work page

[29] [29]

We continue withP(E c n ∩A n ∩B n)

In the following, we will show that also the remaining two probabilities can be bounded from above by ϵ 3. We continue withP(E c n ∩A n ∩B n). Similarly as in [15], we rewrite ˆSt,T −S t,T = St((1 +r) ˆα−1)−(S T −S t) = St(1 +r) α (1 +r) ˆα−α−1 + (St(1 +r) α −S T ) ≤St(1 +r) α log(1 +r) sup a∈[ ˆα,α] elog(1+r)(a−α) |ˆα−α|+|(S t(1 +r) α −S T )|. (22) On th...

work page 2008