Estimating the size of a set using cascading exclusion

Persi Diaconis; Sourav Chatterjee; Susan Holmes

arxiv: 2508.05901 · v4 · submitted 2025-08-07 · 🧮 math.ST · cs.IT· math.IT· math.PR· stat.ML· stat.TH

Estimating the size of a set using cascading exclusion

Sourav Chatterjee , Persi Diaconis , Susan Holmes This is my paper

Pith reviewed 2026-05-18 23:57 UTC · model grok-4.3

classification 🧮 math.ST cs.ITmath.ITmath.PRstat.MLstat.TH

keywords set size estimationcascading exclusionnon-asymptotic boundsunseen species problemconvex volume estimationbirthday problemfinite sample theorypopulation testing

0 comments

The pith

Cascading exclusion interpolates between birthday repeats and maximum estimators to bound the size of a finite set from samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a general non-asymptotic theory for estimating the cardinality of an unknown finite set S from an i.i.d. uniform sample of size n. It introduces refinements that bridge the classic birthday-problem regime, which needs samples of order sqrt(|S|), and the efficient maximum-based estimator available when S is ordered like the integers. The resulting bounds apply directly to estimating the volume of a compact convex set, to the unseen species problem, and to testing whether a new observation is typical from a large prespecified population. The same framework supplies non-parametric finite-n error bounds for regression-style predictors.

Core claim

The authors show that a cascading-exclusion construction produces a general theorem giving non-parametric finite n error bounds for estimating |S|; the bounds interpolate between the birthday-problem setting and the ordered-set maximum estimator and extend to volume estimation, unseen-species counts, and a family of population-testing problems.

What carries the argument

Cascading exclusion, a sequential process that excludes observed elements to tighten bounds on the size of the remaining unobserved portion of the set.

If this is right

The volume of a compact convex set can be estimated with explicit non-asymptotic error bounds.
The unseen species problem receives finite-n error bounds.
Testing whether a new observation is a typical pick from a large prespecified population is supported by the same bounds.
Regression-style predictors obtain non-parametric finite-n error bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The non-asymptotic bounds could be used to choose minimal sample sizes in advance for reliable estimation.
The construction may extend to related support-size problems in high-dimensional statistics.

Load-bearing premise

The observations form an i.i.d. uniform sample from the unknown finite set S.

What would settle it

Repeated draws from a set of known size where the cascading-exclusion error bounds are violated more frequently than the theorem guarantees.

Figures

Figures reproduced from arXiv: 2508.05901 by Persi Diaconis, Sourav Chatterjee, Susan Holmes.

**Figure 2.** Figure 2: Verification of the MSE bound from Theorem [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Verification of the second bound from Theorem [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Illustrations of the volume estimator vol( \K) from corollary 2.5. The ratio E[vol( \K)/vol(K)] approaches 1 (unbiased estimation) as sample size increases for unit cubes, unit balls, triangles, and tetrahedra in two and three dimensions. This convergence supports the validity of the volume estimation approach underlying the corollary’s error bound. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Three examples of partially ordered sets. A line connecting two vertices indicates [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of subforest sampling. (a) A finite convex subset [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Within-Sample and Sample-to-Population Distance Distributions. (a) [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Anderson–Darling statistics. (a) The AD statistic for the leave [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of nearest neighbor distances in the SILVA data. Top: Within-sample [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: MSE of exclusion coverage probabilities between comparing the LOO estimate of [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

read the original abstract

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cascading exclusion gives a workable interpolation between birthday and max estimators for set size, backed by non-asymptotic bounds that extend to volume and species problems.

read the letter

The main point is that Chatterjee, Diaconis, and Holmes introduce cascading exclusion to estimate the size of a finite set from i.i.d. uniform samples. It bridges the gap between the birthday collision method, which needs samples on the order of the square root of the size, and the scaled maximum estimator that works for ordered sets with smaller samples. They supply a general non-asymptotic theory with explicit finite-n error bounds that covers the basic cardinality problem and carries over to volume estimation for compact convex sets, the unseen species problem, and some testing and regression-style predictors. The derivations stay grounded in the plain i.i.d. uniform model without added parameters or circular steps, which keeps the claims straightforward. The math appears clean on the core results, and the stress-test note confirms no obvious internal gaps in the reduction to the listed applications. That counts as real progress over just citing the two classic extremes. One soft spot is the uniform sampling premise itself. In practice, for geometry or ecology data, observations often deviate from perfect uniformity, so the bounds may need adjustment or robustness checks before they translate directly. The paper states the assumption explicitly, but the applications would benefit from more discussion of how sensitive the error bounds are when that premise is only approximate. Another minor point is that while the abstract promises general theorems, the tightness of the constants for moderate sample sizes would be worth verifying in the full proofs. This paper is for statisticians and applied probabilists who need finite-sample tools for cardinality or related inference tasks rather than asymptotic approximations. Readers working on non-parametric estimation or diversity measures would get concrete value from the interpolation and the explicit bounds. I would bring it to a reading group for the technical details on the bounds. It deserves peer review; the core idea is grounded enough and the authors' track record suggests the details are worth referee time even if some applications need tightening.

Referee Report

2 major / 3 minor

Summary. The paper develops a cascading exclusion estimator for the cardinality |S| of an unknown finite set S, based on an i.i.d. uniform sample X_1,...,X_n from S. It interpolates between the classical birthday-problem regime (requiring n ~ |S|^{1/2}) and the max-based estimator available when S is ordered, and supplies a general non-asymptotic theory with explicit finite-n error bounds. The framework is applied to volume estimation of compact convex sets, the unseen-species problem, typicality testing, and regression-style predictors.

Significance. If the central non-asymptotic bounds and their derivations hold, the work supplies a unified, parameter-light approach to cardinality estimation that bridges two classical extremes and yields concrete guarantees for several standard statistical tasks. The explicit treatment of multiple applications and the absence of fitted parameters are strengths.

major comments (2)

[§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.
[Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.

minor comments (3)

[Abstract] The abstract lists 'a host of testing problems' without naming them; a short enumerated list would improve readability.
[§2] Notation for the cascading-exclusion indicator sequence is introduced without a dedicated display equation; adding one would help readers track the subsequent probability calculations.
[Introduction] Several references to classical birthday-problem bounds are given only by name; adding the precise citations (e.g., to the original Erdős–Rényi or modern non-asymptotic versions) would strengthen the literature placement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.

Authors: We appreciate the referee's careful scrutiny of the proof in §4. The error bound is expressed exclusively in terms of the cascading-exclusion probabilities, which are fixed once computed from the sample and do not require separate knowledge of the partition structure of S. The partition influences the probabilities but enters the final bound only through them; no additional implicit dependence remains. To remove any ambiguity, we will revise the theorem statement and expand the proof sketch to explicitly trace the reduction to these observable quantities, confirming that the bound is parameter-free conditional on the probabilities. revision: yes
Referee: [Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.

Authors: We agree that the current treatment of the discretization step in the convex-volume application relies on asymptotic error control. To maintain the uniform finite-n guarantees claimed throughout the paper, we will derive and insert an explicit non-asymptotic bound for the discretization error. This revision will ensure the volume estimator inherits fully non-asymptotic guarantees from the general theorem. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from explicit i.i.d. sampling model

full rationale

The paper states its core modeling premise explicitly (i.i.d. uniform draws from unknown finite set S) and derives the cascading-exclusion probabilities, non-asymptotic error bounds, and applications (volume estimation, unseen species, regression predictors) directly from that premise via standard probabilistic arguments. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the central theorem and its proof remain independent of the target quantities once the sampling model is granted. This is the normal case of a self-contained statistical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain assumption of i.i.d. uniform sampling from a finite set; no free parameters, invented entities, or additional axioms are indicated in the abstract.

axioms (1)

domain assumption X_1,...,X_n are i.i.d. uniform samples from finite set S
This is the explicit modeling premise stated at the opening of the abstract and required for all subsequent estimators and bounds.

pith-pipeline@v0.9.0 · 5714 in / 1213 out tokens · 41987 ms · 2026-05-18T23:57:56.988827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1.1 ... E[(μ(A) − (1/n) Σ 1{Xi ∈ A'})²] ≤ 4δ' + 4(n−1)δ''/n + 2θ/n
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cascading exclusion ... leave-one-out sums ... stability between A, A', A''

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

[1]

A. N. Angelopoulos, R. F. Barber, and S. Bates. Theoretical Foundations of Conformal Prediction. arXiv preprint arXiv:2411.11824, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Austern and W

M. Austern and W. Zhou. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020

work page arXiv 2001
[3]

E. Badr, M. EL-Hakeem, E. E. El-Sharawy, and T. E. Ahmed. An efficient algorithm for decomposition of partially ordered sets.Journal of Mathematics, 2023(1):9920700, 2023

work page 2023
[4]

Baldin and M

N. Baldin and M. Reiß. Unbiased estimation of the volume of a convex body.Stochastic Processes and their Applications, 126(12):3716–3732, 2016

work page 2016
[5]

Bunge and M

J. Bunge and M. Fitzpatrick. Estimating the number of species: A review.Journal of the American Statistical Association, 88(421):364–373, 1993

work page 1993
[6]

B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016. 45

work page 2016
[7]

A. Chao. Nonparametric estimation of the number of classes in a population.Scandinavian Journal of statistics, pages 265–270, 1984

work page 1984
[8]

Chao and S.-M

A. Chao and S.-M. Lee. Estimating the number of classes via sample coverage.Journal of the American statistical Association, 87(417):210–217, 1992

work page 1992
[9]

Y. Chen, P. Diaconis, S. P. Holmes, and J. S. Liu. Sequential Monte Carlo methods for statistical analysis of tables.Journal of the American Statistical Association, 100(469): 109–120, 2005

work page 2005
[10]

A. S. Corbet and R. A. Fisher. The butterfly species of a sample from malaya.Journal of Animal Ecology, 12(1):27–37, 1943

work page 1943
[11]

Darroch and D

J. Darroch and D. Ratcliff. A note on capture-recapture estimation.Biometrics, pages 149–153, 1980

work page 1980
[12]

Devroye.Nonparametric Discrimination and Density Estimation

L. Devroye.Nonparametric Discrimination and Density Estimation. PhD thesis, Univer- sity of Texas at Austin, 1976. URLhttps://apps.dtic.mil/sti/pdfs/ADA032738.pdf

work page 1976
[13]

Devroye, L

L. Devroye, L. Györfi, and G. Lugosi.A Probabilistic Theory of Pattern Recognition, volume 31. Springer Science & Business Media, 2013

work page 2013
[14]

Diaconis and H

P. Diaconis and H. Anderson. Hit and run as a unifying device.J. Soc. Francaise Statist, 148:5–28, 2007

work page 2007
[15]

Diaconis and B

P. Diaconis and B. Efron. Testing for independence in a two-way table: new interpretations of the chi-square statistic.The Annals of Statistics, pages 845–874, 1985

work page 1985
[16]

Diaconis and S

P. Diaconis and S. Holmes. Three examples of Monte -Carlo Markov chains: at the interface between statistical computing, computer science, and statistical mechanics. In Discrete probability and algorithms, pages 43–56. Springer, 1995

work page 1995
[17]

Diaconis and M

P. Diaconis and M. Howes. Random sampling of partitions and contingency tables: Two practical examples of the Burnside process.arXiv preprint arXiv:2503.02818, 2025

work page arXiv 2025
[18]

Diaconis and C

P. Diaconis and C. Stein. Decision theorey, 1981. unpublished chapter

work page 1981
[19]

Diaconis and B

P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of statistics, 26(1):363–397, 1998

work page 1998
[20]

Diaconis and C

P. Diaconis and C. Zhong. Hahn polynomials and the Burnside process.The Ramanujan Journal, pages 1–29, 2020. doi: 10.1007/s11139-021-00482-z. URLhttps://doi.org/ 10.1007/s11139-021-00482-z

work page doi:10.1007/s11139-021-00482-z 2020
[21]

Diaconis and C

P. Diaconis and C. Zhong. Counting the number of group orbits by marrying the burnside process with importance sampling.arXiv preprint arXiv:2501.11731, 2025. 46

work page arXiv 2025
[22]

Diaconis, S

P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. InAdvances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, volume 10, pages 102–126. Institute of Mathematical Statistics, 2013

work page 2013
[23]

Dyer and A

M. Dyer and A. Frieze. Computing the volume of convex bodies: A case where randomness provably helps. In Probabilistic Combinatorics and Its Applications, volume 44 of Proceedings of Symposia in Applied Mathematics, pages 123–170. American Mathematical Society, 1991

work page 1991
[24]

M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies.Journal of the ACM, 35(4):809–832, 1988

work page 1988
[25]

B. Efron. The convex hull of a random set of points.Biometrika, 52(3-4):331–343, 1965

work page 1965
[26]

Efron and R

B. Efron and R. Thisted. Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?Biometrika, 63(3):435–447, 1976

work page 1976
[27]

Efron and R

B. Efron and R. Tibshirani. Statistical data analysis in the computer age.Science, 253 (5018):390–395, 1991

work page 1991
[28]

Favaro, B

S. Favaro, B. Nipoti, and Y. W. Teh. Rediscovery of good-turing estimators via bayesian nonparametrics. Biometrics, 72(1):136–145, 2016

work page 2016
[29]

I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma.Journal of Statistical Computation and Simulation, 66(2):101–111, 2000

work page 2000
[30]

N. J. Gotelli and R. K. Colwell. Estimating species richness. In A. E. Magurran and B. J. McGill, editors,Biological Diversity: Frontiers in Measurement and Assessment, pages 39–54. Oxford University Press, Oxford, UK, 2010

work page 2010
[31]

Harper and S

L. Harper and S. Bezrukov. Monte Carlo estimation of the number of ideals in a poset. Technical report, Dept of Mathematics, UC Riverside, CA, 2008

work page 2008
[32]

S. H. Holmes and W. Huber.Modern statistics for modern biology. Cambridge university press, 2018

work page 2018
[33]

Homrighausen and D

D. Homrighausen and D. J. McDonald. Leave-one-out cross-validation is risk consistent for lasso. Machine learning, 97(1):65–78, 2014

work page 2014
[34]

S. H. Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577–586, 1971

work page 1971
[35]

Hwang, R

W.-H. Hwang, R. Huggins, and L.-F. Chen. A note on the inverse birthday problem with applications. The American Statistician, 71(3):191–201, 2017

work page 2017
[36]

Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38

P. Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38. Bulletin de la Société Vaudoise des Sciences Naturelles, 1902. 47

work page 1902
[37]

Jeffreys.The theory of probability

H. Jeffreys.The theory of probability. Oxford University Press, 1939

work page 1939
[38]

M. Jerrum. Uniform sampling modulo a group of symmetries using Markov chain simula- tion. In J. Friedman, editor,Expanding Graphs, volume 10 ofDIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 37–47. American Mathematical Society, 1993. URLhttps://www.lfcs.inf.ed.ac.uk/reports/93/ECS-LFCS-93-272/ ECS-LFCS-93-272.ps

work page 1993
[39]

Kannan, L

R. Kannan, L. Lovász, and M. Simonovits. Random walks and ano∗(n5) volume algorithm for convex bodies.Random Structures and Algorithms, 11(1):1–50, 1997

work page 1997
[40]

Korte, L

B. Korte, L. Lovász, and R. Schrader.Greedoids, volume 4. Springer Science & Business Media, 2012

work page 2012
[41]

Lijoi, R

A. Lijoi, R. H. Mena, and I. Prünster. Bayesian nonparametric estimation of the probability of discovering new species.Biometrika, 94(4):769–786, 2007

work page 2007
[42]

S.-H. Lo. From the species problem to a general coverage problem via a new interpretation. The Annals of Statistics, 20(2):1094–1109, 1992

work page 1992
[43]

A. Maurer. Concentration of the missing mass in metric spaces. arXiv preprint arXiv:2206.02012, 2022

work page arXiv 2022
[44]

R. D. Millán, N. Sukhorukova, and J. Ugon. Application and issues in abstract convexity. arXiv preprint arXiv:2202.09959, 2022

work page arXiv 2022
[45]

Quast, E

C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590–D596, 2012

work page 2012
[46]

H. E. Robbins. Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Statist., 39(6):256–257, 1968

work page 1968
[47]

W. H. Rogers and T. J. Wagner. A finite sample distribution-free performance bound for local discrimination rules.Annals of Statistics, pages 506–514, 1978

work page 1978
[48]

H. L. Sanders. Marine benthic diversity: a comparative study.The American Naturalist, 102(925):243–282, 1968

work page 1968
[49]

J. Shao. Linear model selection by cross-validation.Journal of the American statistical Association, 88(422):486–494, 1993

work page 1993
[50]

A. F. Siegel and R. Z. German. Rarefaction and taxonomic diversity.Biometrics, pages 235–241, 1982

work page 1982
[51]

Simberloff

D. Simberloff. Properties of the rarefaction diversity measurement. The American Naturalist, 106(949):414–418, 1972. 48

work page 1972
[52]

Sinclair and M

A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains.Information and Computation, 82(1):93–133, 1989

work page 1989
[53]

J. E. Spencer and A. Largey. Geary on inference in multiple regregression and on the taxi problem. Economic and Social Review, 24(3):275–294, 1993

work page 1993
[54]

R. P. Stanley. Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011

work page 2011
[55]

M. Stone. Cross-validation: A review.Statistics: A Journal of Theoretical and Applied Statistics, 9(1):127–139, 1978

work page 1978
[56]

W. T. Trotter.Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, 1992

work page 1992
[57]

L. G. Valiant. The complexity of computing the permanent.Theoretical Computer Science, 8(2):189–201, 1979

work page 1979
[58]

M. L. van De Vel.Theory of convex structures, volume 50. Elsevier, 1993

work page 1993
[59]

Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025

Wikipedia contributors. Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025. Accessed: 26 June 2025

work page 2025
[60]

German tank problem

Wikipedia contributors. German tank problem. https://en.wikipedia.org/wiki/ German_tank_problem, 2025. Accessed: 19 June 2025

work page 2025
[61]

Y. Yang. Consistency of cross validation for comparing regression procedures.The Annals of Statistics, pages 2450–2473, 2007. 49

work page 2007

[1] [1]

A. N. Angelopoulos, R. F. Barber, and S. Bates. Theoretical Foundations of Conformal Prediction. arXiv preprint arXiv:2411.11824, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Austern and W

M. Austern and W. Zhou. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020

work page arXiv 2001

[3] [3]

E. Badr, M. EL-Hakeem, E. E. El-Sharawy, and T. E. Ahmed. An efficient algorithm for decomposition of partially ordered sets.Journal of Mathematics, 2023(1):9920700, 2023

work page 2023

[4] [4]

Baldin and M

N. Baldin and M. Reiß. Unbiased estimation of the volume of a convex body.Stochastic Processes and their Applications, 126(12):3716–3732, 2016

work page 2016

[5] [5]

Bunge and M

J. Bunge and M. Fitzpatrick. Estimating the number of species: A review.Journal of the American Statistical Association, 88(421):364–373, 1993

work page 1993

[6] [6]

B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016. 45

work page 2016

[7] [7]

A. Chao. Nonparametric estimation of the number of classes in a population.Scandinavian Journal of statistics, pages 265–270, 1984

work page 1984

[8] [8]

Chao and S.-M

A. Chao and S.-M. Lee. Estimating the number of classes via sample coverage.Journal of the American statistical Association, 87(417):210–217, 1992

work page 1992

[9] [9]

Y. Chen, P. Diaconis, S. P. Holmes, and J. S. Liu. Sequential Monte Carlo methods for statistical analysis of tables.Journal of the American Statistical Association, 100(469): 109–120, 2005

work page 2005

[10] [10]

A. S. Corbet and R. A. Fisher. The butterfly species of a sample from malaya.Journal of Animal Ecology, 12(1):27–37, 1943

work page 1943

[11] [11]

Darroch and D

J. Darroch and D. Ratcliff. A note on capture-recapture estimation.Biometrics, pages 149–153, 1980

work page 1980

[12] [12]

Devroye.Nonparametric Discrimination and Density Estimation

L. Devroye.Nonparametric Discrimination and Density Estimation. PhD thesis, Univer- sity of Texas at Austin, 1976. URLhttps://apps.dtic.mil/sti/pdfs/ADA032738.pdf

work page 1976

[13] [13]

Devroye, L

L. Devroye, L. Györfi, and G. Lugosi.A Probabilistic Theory of Pattern Recognition, volume 31. Springer Science & Business Media, 2013

work page 2013

[14] [14]

Diaconis and H

P. Diaconis and H. Anderson. Hit and run as a unifying device.J. Soc. Francaise Statist, 148:5–28, 2007

work page 2007

[15] [15]

Diaconis and B

P. Diaconis and B. Efron. Testing for independence in a two-way table: new interpretations of the chi-square statistic.The Annals of Statistics, pages 845–874, 1985

work page 1985

[16] [16]

Diaconis and S

P. Diaconis and S. Holmes. Three examples of Monte -Carlo Markov chains: at the interface between statistical computing, computer science, and statistical mechanics. In Discrete probability and algorithms, pages 43–56. Springer, 1995

work page 1995

[17] [17]

Diaconis and M

P. Diaconis and M. Howes. Random sampling of partitions and contingency tables: Two practical examples of the Burnside process.arXiv preprint arXiv:2503.02818, 2025

work page arXiv 2025

[18] [18]

Diaconis and C

P. Diaconis and C. Stein. Decision theorey, 1981. unpublished chapter

work page 1981

[19] [19]

Diaconis and B

P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of statistics, 26(1):363–397, 1998

work page 1998

[20] [20]

Diaconis and C

P. Diaconis and C. Zhong. Hahn polynomials and the Burnside process.The Ramanujan Journal, pages 1–29, 2020. doi: 10.1007/s11139-021-00482-z. URLhttps://doi.org/ 10.1007/s11139-021-00482-z

work page doi:10.1007/s11139-021-00482-z 2020

[21] [21]

Diaconis and C

P. Diaconis and C. Zhong. Counting the number of group orbits by marrying the burnside process with importance sampling.arXiv preprint arXiv:2501.11731, 2025. 46

work page arXiv 2025

[22] [22]

Diaconis, S

P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. InAdvances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, volume 10, pages 102–126. Institute of Mathematical Statistics, 2013

work page 2013

[23] [23]

Dyer and A

M. Dyer and A. Frieze. Computing the volume of convex bodies: A case where randomness provably helps. In Probabilistic Combinatorics and Its Applications, volume 44 of Proceedings of Symposia in Applied Mathematics, pages 123–170. American Mathematical Society, 1991

work page 1991

[24] [24]

M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies.Journal of the ACM, 35(4):809–832, 1988

work page 1988

[25] [25]

B. Efron. The convex hull of a random set of points.Biometrika, 52(3-4):331–343, 1965

work page 1965

[26] [26]

Efron and R

B. Efron and R. Thisted. Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?Biometrika, 63(3):435–447, 1976

work page 1976

[27] [27]

Efron and R

B. Efron and R. Tibshirani. Statistical data analysis in the computer age.Science, 253 (5018):390–395, 1991

work page 1991

[28] [28]

Favaro, B

S. Favaro, B. Nipoti, and Y. W. Teh. Rediscovery of good-turing estimators via bayesian nonparametrics. Biometrics, 72(1):136–145, 2016

work page 2016

[29] [29]

I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma.Journal of Statistical Computation and Simulation, 66(2):101–111, 2000

work page 2000

[30] [30]

N. J. Gotelli and R. K. Colwell. Estimating species richness. In A. E. Magurran and B. J. McGill, editors,Biological Diversity: Frontiers in Measurement and Assessment, pages 39–54. Oxford University Press, Oxford, UK, 2010

work page 2010

[31] [31]

Harper and S

L. Harper and S. Bezrukov. Monte Carlo estimation of the number of ideals in a poset. Technical report, Dept of Mathematics, UC Riverside, CA, 2008

work page 2008

[32] [32]

S. H. Holmes and W. Huber.Modern statistics for modern biology. Cambridge university press, 2018

work page 2018

[33] [33]

Homrighausen and D

D. Homrighausen and D. J. McDonald. Leave-one-out cross-validation is risk consistent for lasso. Machine learning, 97(1):65–78, 2014

work page 2014

[34] [34]

S. H. Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577–586, 1971

work page 1971

[35] [35]

Hwang, R

W.-H. Hwang, R. Huggins, and L.-F. Chen. A note on the inverse birthday problem with applications. The American Statistician, 71(3):191–201, 2017

work page 2017

[36] [36]

Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38

P. Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38. Bulletin de la Société Vaudoise des Sciences Naturelles, 1902. 47

work page 1902

[37] [37]

Jeffreys.The theory of probability

H. Jeffreys.The theory of probability. Oxford University Press, 1939

work page 1939

[38] [38]

M. Jerrum. Uniform sampling modulo a group of symmetries using Markov chain simula- tion. In J. Friedman, editor,Expanding Graphs, volume 10 ofDIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 37–47. American Mathematical Society, 1993. URLhttps://www.lfcs.inf.ed.ac.uk/reports/93/ECS-LFCS-93-272/ ECS-LFCS-93-272.ps

work page 1993

[39] [39]

Kannan, L

R. Kannan, L. Lovász, and M. Simonovits. Random walks and ano∗(n5) volume algorithm for convex bodies.Random Structures and Algorithms, 11(1):1–50, 1997

work page 1997

[40] [40]

Korte, L

B. Korte, L. Lovász, and R. Schrader.Greedoids, volume 4. Springer Science & Business Media, 2012

work page 2012

[41] [41]

Lijoi, R

A. Lijoi, R. H. Mena, and I. Prünster. Bayesian nonparametric estimation of the probability of discovering new species.Biometrika, 94(4):769–786, 2007

work page 2007

[42] [42]

S.-H. Lo. From the species problem to a general coverage problem via a new interpretation. The Annals of Statistics, 20(2):1094–1109, 1992

work page 1992

[43] [43]

A. Maurer. Concentration of the missing mass in metric spaces. arXiv preprint arXiv:2206.02012, 2022

work page arXiv 2022

[44] [44]

R. D. Millán, N. Sukhorukova, and J. Ugon. Application and issues in abstract convexity. arXiv preprint arXiv:2202.09959, 2022

work page arXiv 2022

[45] [45]

Quast, E

C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590–D596, 2012

work page 2012

[46] [46]

H. E. Robbins. Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Statist., 39(6):256–257, 1968

work page 1968

[47] [47]

W. H. Rogers and T. J. Wagner. A finite sample distribution-free performance bound for local discrimination rules.Annals of Statistics, pages 506–514, 1978

work page 1978

[48] [48]

H. L. Sanders. Marine benthic diversity: a comparative study.The American Naturalist, 102(925):243–282, 1968

work page 1968

[49] [49]

J. Shao. Linear model selection by cross-validation.Journal of the American statistical Association, 88(422):486–494, 1993

work page 1993

[50] [50]

A. F. Siegel and R. Z. German. Rarefaction and taxonomic diversity.Biometrics, pages 235–241, 1982

work page 1982

[51] [51]

Simberloff

D. Simberloff. Properties of the rarefaction diversity measurement. The American Naturalist, 106(949):414–418, 1972. 48

work page 1972

[52] [52]

Sinclair and M

A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains.Information and Computation, 82(1):93–133, 1989

work page 1989

[53] [53]

J. E. Spencer and A. Largey. Geary on inference in multiple regregression and on the taxi problem. Economic and Social Review, 24(3):275–294, 1993

work page 1993

[54] [54]

R. P. Stanley. Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011

work page 2011

[55] [55]

M. Stone. Cross-validation: A review.Statistics: A Journal of Theoretical and Applied Statistics, 9(1):127–139, 1978

work page 1978

[56] [56]

W. T. Trotter.Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, 1992

work page 1992

[57] [57]

L. G. Valiant. The complexity of computing the permanent.Theoretical Computer Science, 8(2):189–201, 1979

work page 1979

[58] [58]

M. L. van De Vel.Theory of convex structures, volume 50. Elsevier, 1993

work page 1993

[59] [59]

Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025

Wikipedia contributors. Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025. Accessed: 26 June 2025

work page 2025

[60] [60]

German tank problem

Wikipedia contributors. German tank problem. https://en.wikipedia.org/wiki/ German_tank_problem, 2025. Accessed: 19 June 2025

work page 2025

[61] [61]

Y. Yang. Consistency of cross validation for comparing regression procedures.The Annals of Statistics, pages 2450–2473, 2007. 49

work page 2007