Estimating the size of a set using cascading exclusion
Pith reviewed 2026-05-18 23:57 UTC · model grok-4.3
The pith
Cascading exclusion interpolates between birthday repeats and maximum estimators to bound the size of a finite set from samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that a cascading-exclusion construction produces a general theorem giving non-parametric finite n error bounds for estimating |S|; the bounds interpolate between the birthday-problem setting and the ordered-set maximum estimator and extend to volume estimation, unseen-species counts, and a family of population-testing problems.
What carries the argument
Cascading exclusion, a sequential process that excludes observed elements to tighten bounds on the size of the remaining unobserved portion of the set.
If this is right
- The volume of a compact convex set can be estimated with explicit non-asymptotic error bounds.
- The unseen species problem receives finite-n error bounds.
- Testing whether a new observation is a typical pick from a large prespecified population is supported by the same bounds.
- Regression-style predictors obtain non-parametric finite-n error bounds.
Where Pith is reading between the lines
- The non-asymptotic bounds could be used to choose minimal sample sizes in advance for reliable estimation.
- The construction may extend to related support-size problems in high-dimensional statistics.
Load-bearing premise
The observations form an i.i.d. uniform sample from the unknown finite set S.
What would settle it
Repeated draws from a set of known size where the cascading-exclusion error bounds are violated more frequently than the theorem guarantees.
Figures
read the original abstract
Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a cascading exclusion estimator for the cardinality |S| of an unknown finite set S, based on an i.i.d. uniform sample X_1,...,X_n from S. It interpolates between the classical birthday-problem regime (requiring n ~ |S|^{1/2}) and the max-based estimator available when S is ordered, and supplies a general non-asymptotic theory with explicit finite-n error bounds. The framework is applied to volume estimation of compact convex sets, the unseen-species problem, typicality testing, and regression-style predictors.
Significance. If the central non-asymptotic bounds and their derivations hold, the work supplies a unified, parameter-light approach to cardinality estimation that bridges two classical extremes and yields concrete guarantees for several standard statistical tasks. The explicit treatment of multiple applications and the absence of fitted parameters are strengths.
major comments (2)
- [§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.
- [Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.
minor comments (3)
- [Abstract] The abstract lists 'a host of testing problems' without naming them; a short enumerated list would improve readability.
- [§2] Notation for the cascading-exclusion indicator sequence is introduced without a dedicated display equation; adding one would help readers track the subsequent probability calculations.
- [Introduction] Several references to classical birthday-problem bounds are given only by name; adding the precise citations (e.g., to the original Erdős–Rényi or modern non-asymptotic versions) would strengthen the literature placement.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (general theorem): the non-asymptotic error bound is stated to be 'parameter-free' once the cascading-exclusion probabilities are fixed, yet the proof sketch appears to retain an implicit dependence on the unknown partition structure of S that is not reduced to observable quantities; this needs to be clarified or removed for the claim to be load-bearing.
Authors: We appreciate the referee's careful scrutiny of the proof in §4. The error bound is expressed exclusively in terms of the cascading-exclusion probabilities, which are fixed once computed from the sample and do not require separate knowledge of the partition structure of S. The partition influences the probabilities but enters the final bound only through them; no additional implicit dependence remains. To remove any ambiguity, we will revise the theorem statement and expand the proof sketch to explicitly trace the reduction to these observable quantities, confirming that the bound is parameter-free conditional on the probabilities. revision: yes
-
Referee: [Application to convex-volume estimation] Application to convex-volume estimation (around Eq. (12)): the reduction from the general theorem to the volume estimator invokes an additional discretization step whose error is bounded only asymptotically; a fully non-asymptotic version of this step is required to support the paper's uniform finite-n guarantee.
Authors: We agree that the current treatment of the discretization step in the convex-volume application relies on asymptotic error control. To maintain the uniform finite-n guarantees claimed throughout the paper, we will derive and insert an explicit non-asymptotic bound for the discretization error. This revision will ensure the volume estimator inherits fully non-asymptotic guarantees from the general theorem. revision: yes
Circularity Check
Derivation self-contained from explicit i.i.d. sampling model
full rationale
The paper states its core modeling premise explicitly (i.i.d. uniform draws from unknown finite set S) and derives the cascading-exclusion probabilities, non-asymptotic error bounds, and applications (volume estimation, unseen species, regression predictors) directly from that premise via standard probabilistic arguments. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the central theorem and its proof remain independent of the target quantities once the sampling model is granted. This is the normal case of a self-contained statistical derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption X_1,...,X_n are i.i.d. uniform samples from finite set S
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1.1 ... E[(μ(A) − (1/n) Σ 1{Xi ∈ A'})²] ≤ 4δ' + 4(n−1)δ''/n + 2θ/n
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cascading exclusion ... leave-one-out sums ... stability between A, A', A''
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. N. Angelopoulos, R. F. Barber, and S. Bates. Theoretical Foundations of Conformal Prediction. arXiv preprint arXiv:2411.11824, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
M. Austern and W. Zhou. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020
-
[3]
E. Badr, M. EL-Hakeem, E. E. El-Sharawy, and T. E. Ahmed. An efficient algorithm for decomposition of partially ordered sets.Journal of Mathematics, 2023(1):9920700, 2023
work page 2023
-
[4]
N. Baldin and M. Reiß. Unbiased estimation of the volume of a convex body.Stochastic Processes and their Applications, 126(12):3716–3732, 2016
work page 2016
-
[5]
J. Bunge and M. Fitzpatrick. Estimating the number of species: A review.Journal of the American Statistical Association, 88(421):364–373, 1993
work page 1993
-
[6]
B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes. Dada2: High-resolution sample inference from illumina amplicon data.Nature methods, 13(7):581–583, 2016. 45
work page 2016
-
[7]
A. Chao. Nonparametric estimation of the number of classes in a population.Scandinavian Journal of statistics, pages 265–270, 1984
work page 1984
-
[8]
A. Chao and S.-M. Lee. Estimating the number of classes via sample coverage.Journal of the American statistical Association, 87(417):210–217, 1992
work page 1992
-
[9]
Y. Chen, P. Diaconis, S. P. Holmes, and J. S. Liu. Sequential Monte Carlo methods for statistical analysis of tables.Journal of the American Statistical Association, 100(469): 109–120, 2005
work page 2005
-
[10]
A. S. Corbet and R. A. Fisher. The butterfly species of a sample from malaya.Journal of Animal Ecology, 12(1):27–37, 1943
work page 1943
-
[11]
J. Darroch and D. Ratcliff. A note on capture-recapture estimation.Biometrics, pages 149–153, 1980
work page 1980
-
[12]
Devroye.Nonparametric Discrimination and Density Estimation
L. Devroye.Nonparametric Discrimination and Density Estimation. PhD thesis, Univer- sity of Texas at Austin, 1976. URLhttps://apps.dtic.mil/sti/pdfs/ADA032738.pdf
work page 1976
-
[13]
L. Devroye, L. Györfi, and G. Lugosi.A Probabilistic Theory of Pattern Recognition, volume 31. Springer Science & Business Media, 2013
work page 2013
-
[14]
P. Diaconis and H. Anderson. Hit and run as a unifying device.J. Soc. Francaise Statist, 148:5–28, 2007
work page 2007
-
[15]
P. Diaconis and B. Efron. Testing for independence in a two-way table: new interpretations of the chi-square statistic.The Annals of Statistics, pages 845–874, 1985
work page 1985
-
[16]
P. Diaconis and S. Holmes. Three examples of Monte -Carlo Markov chains: at the interface between statistical computing, computer science, and statistical mechanics. In Discrete probability and algorithms, pages 43–56. Springer, 1995
work page 1995
-
[17]
P. Diaconis and M. Howes. Random sampling of partitions and contingency tables: Two practical examples of the Burnside process.arXiv preprint arXiv:2503.02818, 2025
- [18]
-
[19]
P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of statistics, 26(1):363–397, 1998
work page 1998
-
[20]
P. Diaconis and C. Zhong. Hahn polynomials and the Burnside process.The Ramanujan Journal, pages 1–29, 2020. doi: 10.1007/s11139-021-00482-z. URLhttps://doi.org/ 10.1007/s11139-021-00482-z
-
[21]
P. Diaconis and C. Zhong. Counting the number of group orbits by marrying the burnside process with importance sampling.arXiv preprint arXiv:2501.11731, 2025. 46
-
[22]
P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. InAdvances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, volume 10, pages 102–126. Institute of Mathematical Statistics, 2013
work page 2013
-
[23]
M. Dyer and A. Frieze. Computing the volume of convex bodies: A case where randomness provably helps. In Probabilistic Combinatorics and Its Applications, volume 44 of Proceedings of Symposia in Applied Mathematics, pages 123–170. American Mathematical Society, 1991
work page 1991
-
[24]
M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies.Journal of the ACM, 35(4):809–832, 1988
work page 1988
-
[25]
B. Efron. The convex hull of a random set of points.Biometrika, 52(3-4):331–343, 1965
work page 1965
-
[26]
B. Efron and R. Thisted. Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?Biometrika, 63(3):435–447, 1976
work page 1976
-
[27]
B. Efron and R. Tibshirani. Statistical data analysis in the computer age.Science, 253 (5018):390–395, 1991
work page 1991
- [28]
-
[29]
I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma.Journal of Statistical Computation and Simulation, 66(2):101–111, 2000
work page 2000
-
[30]
N. J. Gotelli and R. K. Colwell. Estimating species richness. In A. E. Magurran and B. J. McGill, editors,Biological Diversity: Frontiers in Measurement and Assessment, pages 39–54. Oxford University Press, Oxford, UK, 2010
work page 2010
-
[31]
L. Harper and S. Bezrukov. Monte Carlo estimation of the number of ideals in a poset. Technical report, Dept of Mathematics, UC Riverside, CA, 2008
work page 2008
-
[32]
S. H. Holmes and W. Huber.Modern statistics for modern biology. Cambridge university press, 2018
work page 2018
-
[33]
D. Homrighausen and D. J. McDonald. Leave-one-out cross-validation is risk consistent for lasso. Machine learning, 97(1):65–78, 2014
work page 2014
-
[34]
S. H. Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577–586, 1971
work page 1971
- [35]
-
[36]
Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38
P. Jaccard.Lois de Distribution Florale dans la Zone Alpine, volume 38. Bulletin de la Société Vaudoise des Sciences Naturelles, 1902. 47
work page 1902
-
[37]
Jeffreys.The theory of probability
H. Jeffreys.The theory of probability. Oxford University Press, 1939
work page 1939
-
[38]
M. Jerrum. Uniform sampling modulo a group of symmetries using Markov chain simula- tion. In J. Friedman, editor,Expanding Graphs, volume 10 ofDIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 37–47. American Mathematical Society, 1993. URLhttps://www.lfcs.inf.ed.ac.uk/reports/93/ECS-LFCS-93-272/ ECS-LFCS-93-272.ps
work page 1993
- [39]
- [40]
- [41]
-
[42]
S.-H. Lo. From the species problem to a general coverage problem via a new interpretation. The Annals of Statistics, 20(2):1094–1109, 1992
work page 1992
- [43]
- [44]
- [45]
-
[46]
H. E. Robbins. Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Statist., 39(6):256–257, 1968
work page 1968
-
[47]
W. H. Rogers and T. J. Wagner. A finite sample distribution-free performance bound for local discrimination rules.Annals of Statistics, pages 506–514, 1978
work page 1978
-
[48]
H. L. Sanders. Marine benthic diversity: a comparative study.The American Naturalist, 102(925):243–282, 1968
work page 1968
-
[49]
J. Shao. Linear model selection by cross-validation.Journal of the American statistical Association, 88(422):486–494, 1993
work page 1993
-
[50]
A. F. Siegel and R. Z. German. Rarefaction and taxonomic diversity.Biometrics, pages 235–241, 1982
work page 1982
-
[51]
D. Simberloff. Properties of the rarefaction diversity measurement. The American Naturalist, 106(949):414–418, 1972. 48
work page 1972
-
[52]
A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains.Information and Computation, 82(1):93–133, 1989
work page 1989
-
[53]
J. E. Spencer and A. Largey. Geary on inference in multiple regregression and on the taxi problem. Economic and Social Review, 24(3):275–294, 1993
work page 1993
-
[54]
R. P. Stanley. Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011
work page 2011
-
[55]
M. Stone. Cross-validation: A review.Statistics: A Journal of Theoretical and Applied Statistics, 9(1):127–139, 1978
work page 1978
-
[56]
W. T. Trotter.Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, 1992
work page 1992
-
[57]
L. G. Valiant. The complexity of computing the permanent.Theoretical Computer Science, 8(2):189–201, 1979
work page 1979
-
[58]
M. L. van De Vel.Theory of convex structures, volume 50. Elsevier, 1993
work page 1993
-
[59]
Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025
Wikipedia contributors. Convex volume approximation.https://en.wikipedia.org/ wiki/Convex_volume_approximation, 2025. Accessed: 26 June 2025
work page 2025
-
[60]
Wikipedia contributors. German tank problem. https://en.wikipedia.org/wiki/ German_tank_problem, 2025. Accessed: 19 June 2025
work page 2025
-
[61]
Y. Yang. Consistency of cross validation for comparing regression procedures.The Annals of Statistics, pages 2450–2473, 2007. 49
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.