pith. sign in

arxiv: 2508.05901 · v4 · submitted 2025-08-07 · 🧮 math.ST · cs.IT· math.IT· math.PR· stat.ML· stat.TH

Estimating the size of a set using cascading exclusion

Pith reviewed 2026-05-18 23:57 UTC · model grok-4.3

classification 🧮 math.ST cs.ITmath.ITmath.PRstat.MLstat.TH
keywords set size estimationcascading exclusionnon-asymptotic boundsunseen species problemconvex volume estimationbirthday problemfinite sample theorypopulation testing
0
0 comments X

The pith

Cascading exclusion interpolates between birthday repeats and maximum estimators to bound the size of a finite set from samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a general non-asymptotic theory for estimating the cardinality of an unknown finite set S from an i.i.d. uniform sample of size n. It introduces refinements that bridge the classic birthday-problem regime, which needs samples of order sqrt(|S|), and the efficient maximum-based estimator available when S is ordered like the integers. The resulting bounds apply directly to estimating the volume of a compact convex set, to the unseen species problem, and to testing whether a new observation is typical from a large prespecified population. The same framework supplies non-parametric finite-n error bounds for regression-style predictors.

Core claim

The authors show that a cascading-exclusion construction produces a general theorem giving non-parametric finite n error bounds for estimating |S|; the bounds interpolate between the birthday-problem setting and the ordered-set maximum estimator and extend to volume estimation, unseen-species counts, and a family of population-testing problems.

What carries the argument

Cascading exclusion, a sequential process that excludes observed elements to tighten bounds on the size of the remaining unobserved portion of the set.

If this is right

  • The volume of a compact convex set can be estimated with explicit non-asymptotic error bounds.
  • The unseen species problem receives finite-n error bounds.
  • Testing whether a new observation is a typical pick from a large prespecified population is supported by the same bounds.
  • Regression-style predictors obtain non-parametric finite-n error bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The non-asymptotic bounds could be used to choose minimal sample sizes in advance for reliable estimation.
  • The construction may extend to related support-size problems in high-dimensional statistics.

Load-bearing premise

The observations form an i.i.d. uniform sample from the unknown finite set S.

What would settle it

Repeated draws from a set of known size where the cascading-exclusion error bounds are violated more frequently than the theorem guarantees.

Figures

Figures reproduced from arXiv: 2508.05901 by Persi Diaconis, Sourav Chatterjee, Susan Holmes.

Figure 1
Figure 1. Figure 1: Convex hull of 100 randomly generated points uniformly distributed in the rectangle [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Verification of the MSE bound from Theorem [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Verification of the second bound from Theorem [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustrations of the volume estimator vol( \K) from corollary 2.5. The ratio E[vol( \K)/vol(K)] approaches 1 (unbiased estimation) as sample size increases for unit cubes, unit balls, triangles, and tetrahedra in two and three dimensions. This convergence supports the validity of the volume estimation approach underlying the corollary’s error bound. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three examples of partially ordered sets. A line connecting two vertices indicates [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of subforest sampling. (a) A finite convex subset [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Within-Sample and Sample-to-Population Distance Distributions. (a) [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Anderson–Darling statistics. (a) The AD statistic for the leave [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of nearest neighbor distances in the SILVA data. Top: Within-sample [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MSE of exclusion coverage probabilities between comparing the LOO estimate of [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper develops a cascading exclusion estimator for the cardinality |S| of an unknown finite set S, based on an i.i.d. uniform sample X_1,...,X_n from S. It interpolates between the classical birthday-problem regime (requiring n ~ |S|^{1/2}) and the max-based estimator available when S is ordered, and supplies a general non-asymptotic theory with explicit finite-n error bounds. The framework is applied to volume estimation of compact convex sets, the unseen-species problem, typicality testing, and regression-style predictors.

Significance. If the central non-asymptotic bounds and their derivations hold, the work supplies a unified, parameter-light approach to cardinality estimation that bridges two classical extremes and yields concrete guarantees for several standard statistical tasks. The explicit treatment of multiple applications and the absence of fitted parameters are strengths.

major comments (2)
  1. [§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.
  2. [Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.
minor comments (3)
  1. [Abstract] The abstract lists 'a host of testing problems' without naming them; a short enumerated list would improve readability.
  2. [§2] Notation for the cascading-exclusion indicator sequence is introduced without a dedicated display equation; adding one would help readers track the subsequent probability calculations.
  3. [Introduction] Several references to classical birthday-problem bounds are given only by name; adding the precise citations (e.g., to the original Erdős–Rényi or modern non-asymptotic versions) would strengthen the literature placement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.

    Authors: We appreciate the referee's careful scrutiny of the proof in §4. The error bound is expressed exclusively in terms of the cascading-exclusion probabilities, which are fixed once computed from the sample and do not require separate knowledge of the partition structure of S. The partition influences the probabilities but enters the final bound only through them; no additional implicit dependence remains. To remove any ambiguity, we will revise the theorem statement and expand the proof sketch to explicitly trace the reduction to these observable quantities, confirming that the bound is parameter-free conditional on the probabilities. revision: yes

  2. Referee: [Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.

    Authors: We agree that the current treatment of the discretization step in the convex-volume application relies on asymptotic error control. To maintain the uniform finite-n guarantees claimed throughout the paper, we will derive and insert an explicit non-asymptotic bound for the discretization error. This revision will ensure the volume estimator inherits fully non-asymptotic guarantees from the general theorem. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from explicit i.i.d. sampling model

full rationale

The paper states its core modeling premise explicitly (i.i.d. uniform draws from unknown finite set S) and derives the cascading-exclusion probabilities, non-asymptotic error bounds, and applications (volume estimation, unseen species, regression predictors) directly from that premise via standard probabilistic arguments. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the central theorem and its proof remain independent of the target quantities once the sampling model is granted. This is the normal case of a self-contained statistical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain assumption of i.i.d. uniform sampling from a finite set; no free parameters, invented entities, or additional axioms are indicated in the abstract.

axioms (1)
  • domain assumption X_1,...,X_n are i.i.d. uniform samples from finite set S
    This is the explicit modeling premise stated at the opening of the abstract and required for all subsequent estimators and bounds.

pith-pipeline@v0.9.0 · 5714 in / 1213 out tokens · 41987 ms · 2026-05-18T23:57:56.988827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    A. N. Angelopoulos, R. F. Barber, and S. Bates. Theoretical Foundations of Conformal Prediction. arXiv preprint arXiv:2411.11824, 2024

  2. [2]

    Austern and W

    M. Austern and W. Zhou. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020

  3. [3]

    E. Badr, M. EL-Hakeem, E. E. El-Sharawy, and T. E. Ahmed. An efficient algorithm for decomposition of partially ordered sets.Journal of Mathematics, 2023(1):9920700, 2023

  4. [4]

    Baldin and M

    N. Baldin and M. Reiß. Unbiased estimation of the volume of a convex body.Stochastic Processes and their Applications, 126(12):3716–3732, 2016

  5. [5]

    Bunge and M

    J. Bunge and M. Fitzpatrick. Estimating the number of species: A review.Journal of the American Statistical Association, 88(421):364–373, 1993

  6. [6]

    B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016. 45

  7. [7]

    A. Chao. Nonparametric estimation of the number of classes in a population.Scandinavian Journal of statistics, pages 265–270, 1984

  8. [8]

    Chao and S.-M

    A. Chao and S.-M. Lee. Estimating the number of classes via sample coverage.Journal of the American statistical Association, 87(417):210–217, 1992

  9. [9]

    Y. Chen, P. Diaconis, S. P. Holmes, and J. S. Liu. Sequential Monte Carlo methods for statistical analysis of tables.Journal of the American Statistical Association, 100(469): 109–120, 2005

  10. [10]

    A. S. Corbet and R. A. Fisher. The butterfly species of a sample from malaya.Journal of Animal Ecology, 12(1):27–37, 1943

  11. [11]

    Darroch and D

    J. Darroch and D. Ratcliff. A note on capture-recapture estimation.Biometrics, pages 149–153, 1980

  12. [12]

    Devroye.Nonparametric Discrimination and Density Estimation

    L. Devroye.Nonparametric Discrimination and Density Estimation. PhD thesis, Univer- sity of Texas at Austin, 1976. URLhttps://apps.dtic.mil/sti/pdfs/ADA032738.pdf

  13. [13]

    Devroye, L

    L. Devroye, L. Györfi, and G. Lugosi.A Probabilistic Theory of Pattern Recognition, volume 31. Springer Science & Business Media, 2013

  14. [14]

    Diaconis and H

    P. Diaconis and H. Anderson. Hit and run as a unifying device.J. Soc. Francaise Statist, 148:5–28, 2007

  15. [15]

    Diaconis and B

    P. Diaconis and B. Efron. Testing for independence in a two-way table: new interpretations of the chi-square statistic.The Annals of Statistics, pages 845–874, 1985

  16. [16]

    Diaconis and S

    P. Diaconis and S. Holmes. Three examples of Monte -Carlo Markov chains: at the interface between statistical computing, computer science, and statistical mechanics. In Discrete probability and algorithms, pages 43–56. Springer, 1995

  17. [17]

    Diaconis and M

    P. Diaconis and M. Howes. Random sampling of partitions and contingency tables: Two practical examples of the Burnside process.arXiv preprint arXiv:2503.02818, 2025

  18. [18]

    Diaconis and C

    P. Diaconis and C. Stein. Decision theorey, 1981. unpublished chapter

  19. [19]

    Diaconis and B

    P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of statistics, 26(1):363–397, 1998

  20. [20]

    Diaconis and C

    P. Diaconis and C. Zhong. Hahn polynomials and the Burnside process.The Ramanujan Journal, pages 1–29, 2020. doi: 10.1007/s11139-021-00482-z. URLhttps://doi.org/ 10.1007/s11139-021-00482-z

  21. [21]

    Diaconis and C

    P. Diaconis and C. Zhong. Counting the number of group orbits by marrying the burnside process with importance sampling.arXiv preprint arXiv:2501.11731, 2025. 46

  22. [22]

    Diaconis, S

    P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. InAdvances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, volume 10, pages 102–126. Institute of Mathematical Statistics, 2013

  23. [23]

    Dyer and A

    M. Dyer and A. Frieze. Computing the volume of convex bodies: A case where randomness provably helps. In Probabilistic Combinatorics and Its Applications, volume 44 of Proceedings of Symposia in Applied Mathematics, pages 123–170. American Mathematical Society, 1991

  24. [24]

    M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies.Journal of the ACM, 35(4):809–832, 1988

  25. [25]

    B. Efron. The convex hull of a random set of points.Biometrika, 52(3-4):331–343, 1965

  26. [26]

    Efron and R

    B. Efron and R. Thisted. Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?Biometrika, 63(3):435–447, 1976

  27. [27]

    Efron and R

    B. Efron and R. Tibshirani. Statistical data analysis in the computer age.Science, 253 (5018):390–395, 1991

  28. [28]

    Favaro, B

    S. Favaro, B. Nipoti, and Y. W. Teh. Rediscovery of good-turing estimators via bayesian nonparametrics. Biometrics, 72(1):136–145, 2016

  29. [29]

    I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma.Journal of Statistical Computation and Simulation, 66(2):101–111, 2000

  30. [30]

    N. J. Gotelli and R. K. Colwell. Estimating species richness. In A. E. Magurran and B. J. McGill, editors,Biological Diversity: Frontiers in Measurement and Assessment, pages 39–54. Oxford University Press, Oxford, UK, 2010

  31. [31]

    Harper and S

    L. Harper and S. Bezrukov. Monte Carlo estimation of the number of ideals in a poset. Technical report, Dept of Mathematics, UC Riverside, CA, 2008

  32. [32]

    S. H. Holmes and W. Huber.Modern statistics for modern biology. Cambridge university press, 2018

  33. [33]

    Homrighausen and D

    D. Homrighausen and D. J. McDonald. Leave-one-out cross-validation is risk consistent for lasso. Machine learning, 97(1):65–78, 2014

  34. [34]

    S. H. Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577–586, 1971

  35. [35]

    Hwang, R

    W.-H. Hwang, R. Huggins, and L.-F. Chen. A note on the inverse birthday problem with applications. The American Statistician, 71(3):191–201, 2017

  36. [36]

    Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38

    P. Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38. Bulletin de la Société Vaudoise des Sciences Naturelles, 1902. 47

  37. [37]

    Jeffreys.The theory of probability

    H. Jeffreys.The theory of probability. Oxford University Press, 1939

  38. [38]

    M. Jerrum. Uniform sampling modulo a group of symmetries using Markov chain simula- tion. In J. Friedman, editor,Expanding Graphs, volume 10 ofDIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 37–47. American Mathematical Society, 1993. URLhttps://www.lfcs.inf.ed.ac.uk/reports/93/ECS-LFCS-93-272/ ECS-LFCS-93-272.ps

  39. [39]

    Kannan, L

    R. Kannan, L. Lovász, and M. Simonovits. Random walks and ano∗(n5) volume algorithm for convex bodies.Random Structures and Algorithms, 11(1):1–50, 1997

  40. [40]

    Korte, L

    B. Korte, L. Lovász, and R. Schrader.Greedoids, volume 4. Springer Science & Business Media, 2012

  41. [41]

    Lijoi, R

    A. Lijoi, R. H. Mena, and I. Prünster. Bayesian nonparametric estimation of the probability of discovering new species.Biometrika, 94(4):769–786, 2007

  42. [42]

    S.-H. Lo. From the species problem to a general coverage problem via a new interpretation. The Annals of Statistics, 20(2):1094–1109, 1992

  43. [43]

    A. Maurer. Concentration of the missing mass in metric spaces. arXiv preprint arXiv:2206.02012, 2022

  44. [44]

    R. D. Millán, N. Sukhorukova, and J. Ugon. Application and issues in abstract convexity. arXiv preprint arXiv:2202.09959, 2022

  45. [45]

    Quast, E

    C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590–D596, 2012

  46. [46]

    H. E. Robbins. Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Statist., 39(6):256–257, 1968

  47. [47]

    W. H. Rogers and T. J. Wagner. A finite sample distribution-free performance bound for local discrimination rules.Annals of Statistics, pages 506–514, 1978

  48. [48]

    H. L. Sanders. Marine benthic diversity: a comparative study.The American Naturalist, 102(925):243–282, 1968

  49. [49]

    J. Shao. Linear model selection by cross-validation.Journal of the American statistical Association, 88(422):486–494, 1993

  50. [50]

    A. F. Siegel and R. Z. German. Rarefaction and taxonomic diversity.Biometrics, pages 235–241, 1982

  51. [51]

    Simberloff

    D. Simberloff. Properties of the rarefaction diversity measurement. The American Naturalist, 106(949):414–418, 1972. 48

  52. [52]

    Sinclair and M

    A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains.Information and Computation, 82(1):93–133, 1989

  53. [53]

    J. E. Spencer and A. Largey. Geary on inference in multiple regregression and on the taxi problem. Economic and Social Review, 24(3):275–294, 1993

  54. [54]

    R. P. Stanley. Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011

  55. [55]

    M. Stone. Cross-validation: A review.Statistics: A Journal of Theoretical and Applied Statistics, 9(1):127–139, 1978

  56. [56]

    W. T. Trotter.Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, 1992

  57. [57]

    L. G. Valiant. The complexity of computing the permanent.Theoretical Computer Science, 8(2):189–201, 1979

  58. [58]

    M. L. van De Vel.Theory of convex structures, volume 50. Elsevier, 1993

  59. [59]

    Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025

    Wikipedia contributors. Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025. Accessed: 26 June 2025

  60. [60]

    German tank problem

    Wikipedia contributors. German tank problem. https://en.wikipedia.org/wiki/ German_tank_problem, 2025. Accessed: 19 June 2025

  61. [61]

    Y. Yang. Consistency of cross validation for comparing regression procedures.The Annals of Statistics, pages 2450–2473, 2007. 49