Bayesian discovery of species in multiple areas
Pith reviewed 2026-05-23 03:54 UTC · model grok-4.3
The pith
Bayesian nonparametric priors on two heterogeneous areas yield exact distributions for observed and predicted counts of distinct and shared species.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling species abundances in each of the two areas with Bayesian nonparametric priors that preserve exchangeability within areas and induce a joint predictive structure across areas, the authors obtain the exact joint distribution of the counts of distinct species in each area and the count of species shared between areas for any finite observed sample. The same construction directly supplies the predictive distributions for the corresponding counts in future samples drawn from the same two areas, for any choice of future sample sizes.
What carries the argument
The distributional theory for in-sample and out-of-sample counts of distinct and shared species induced by Bayesian nonparametric priors on the two-area species abundance measures.
If this is right
- Exact predictive distributions become available for the number of unseen distinct species in each area and the number of shared unseen species between areas, for any future sample sizes.
- The theory extends one-step-ahead frequentist estimators to arbitrary numbers of future observations and supplies full probability distributions rather than point estimates.
- Sample-size calculations are possible for any target number of distinct or shared species to be detected.
- In-sample analysis of any finite observed sample yields the joint distribution of distinct and shared species counts without approximation.
Where Pith is reading between the lines
- The same construction could be applied to other two-population problems such as estimating unique terms shared between two text corpora.
- If the priors can be extended while preserving the predictive structure, the method would apply to three or more areas.
- The distributional results could be used to design stratified sampling schemes that account for known habitat differences between areas.
Load-bearing premise
The species abundances in the two areas are generated by Bayesian nonparametric priors that deliver the required within-area exchangeability and cross-area predictive structure for shared and distinct species.
What would settle it
Draw new samples from two areas with known heterogeneity, compute the observed numbers of new distinct and shared species, and check whether those numbers fall inside the probability intervals given by the model's predictive distributions.
Figures
read the original abstract
In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a Bayesian nonparametric framework for species sampling from two heterogeneous areas. It derives exact distributional results for in-sample analysis of observed samples and predictive formulas for the number of unseen distinct and shared species in out-of-sample draws of arbitrary size, extending one-step-ahead frequentist estimators; the results are also applied to sample-size determination and illustrated on an ant population dataset from Trieste.
Significance. If the derivations are correct, the work would be significant for filling a gap in the BNP species-sampling literature: it supplies closed-form distributional and predictive results for the heterogeneous two-area case (previously limited to homogeneous populations or purely computational methods), thereby enabling exact inference, arbitrary-horizon prediction, and design applications without simulation.
minor comments (2)
- [Abstract] Abstract: the specific BNP priors (e.g., Dirichlet process, Pitman-Yor, or normalized completely random measures) inducing the required exchangeability are left implicit; stating them explicitly would clarify the modeling assumptions that enable the claimed predictive structure.
- [Abstract] The abstract states that 'distributional theory and predictions are provided,' yet the provided text contains no displayed equations, error bounds, or verification steps; ensure the full manuscript includes at least one representative derivation (e.g., the joint distribution of unseen species counts) with a clear statement of the exchangeability assumptions used.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive summary, and recommendation of minor revision. No specific major comments were raised in the report, so we provide no point-by-point responses below. We will incorporate any minor editorial or typographical suggestions in the revised manuscript.
Circularity Check
No significant circularity
full rationale
The paper derives exact distributional results and predictive formulas for in-sample and out-of-sample species counts under a two-area heterogeneous BNP model. These are presented as new mathematical extensions from the homogeneous case, not as quantities obtained by fitting parameters to data and then renaming them as predictions, nor by self-citation chains that reduce the central claim to prior unverified work by the same authors. The modeling assumptions (exchangeability and predictive structure induced by BNP priors) are fixed inputs, and the contribution consists of independent derivations from those assumptions, making the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Species abundances in each area follow a Bayesian nonparametric prior permitting exchangeable sampling and predictive distributions for unseen species.
Reference graph
Works this paper leans on
-
[1]
Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series, vol No. 55. U. S. Government Printing Office, Washington, DC, for sale by the Superintendent of Documents
work page 1964
-
[2]
Argiento R, De Iorio M (2022) Is infinity that far? A B ayesian nonparametric perspective of finite mixture models. Ann Statist 50(5):2641--2663
work page 2022
-
[3]
J Statist Plann Inference 166:14--23, special Issue on Bayesian Nonparametrics
Bacallado S, Favaro S, Trippa L (2015) Bayesian nonparametric inference for shared species richness in multiple populations. J Statist Plann Inference 166:14--23, special Issue on Bayesian Nonparametrics
work page 2015
-
[4]
Balocchi C, Camerlenghi F, Favaro S (2024) A bayesian nonparametric approach to species sampling problems with ordering. Bayesian Anal 1:1--26
work page 2024
-
[5]
Balocchi C, Favaro S, Naulet Z (2025) Bayesian nonparametric inference for “species-sampling” problems. Statist Sci (to appear.)
work page 2025
-
[6]
J Multivariate Anal 156:18--28
Camerlenghi F, Lijoi A, Prünster I (2017) Bayesian prediction with multiple-samples information. J Multivariate Anal 156:18--28
work page 2017
-
[7]
Camerlenghi F, Lijoi A, Orbanz P, Pr\" u nster I (2019) Distribution theory for hierarchical processes. Ann Statist 47(1):67--92
work page 2019
-
[8]
J Amer Statist Assoc 119(545):320--331
Camerlenghi F, Favaro S, Masoero L, Broderick T (2024) Scaled process priors for B ayesian nonparametric estimation of the unseen genetic variation. J Amer Statist Assoc 119(545):320--331
work page 2024
-
[9]
Chao A (1984) Nonparametric estimation of the number of classes in a population. Scand J Stat 11(4):265--270
work page 1984
-
[10]
Chao A, Yang MCK (1993) Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80(1):193--201
work page 1993
-
[11]
Chao A, Hwang WH, Chen YC, Kuo CY (2000) Estimating the number of shared species in two communities. Statist Sinica 10(1):227--246
work page 2000
-
[12]
Chao A, Chazdon RL, Colwell RK, Shen TJ (2006 a ) Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62(2):361--371
work page 2006
-
[13]
Aust N Z J Stat 48(2):117--128
Chao A, Shen TJ, Hwang WH (2006 b ) Application of laplace's boundary-mode approximations to estimate species and shared species richness. Aust N Z J Stat 48(2):117--128
work page 2006
-
[14]
Species-richness prediction and diversity estimation with R
Chao A, Ma K, Hsieh T, Chiu CH, Chao MA (2016) Package SpadeR . Species-richness prediction and diversity estimation with R
work page 2016
-
[15]
Chao A, Chiu CH, Colwell RK, Magnago LFS, Chazdon RL, Gotelli NJ (2017) Deciphering the enigma of undetected species, phylogenetic, and functional diversity based on good-turing theory. Ecology 98(11):2914--2929
work page 2017
- [16]
-
[17]
Chuang C, Shen T, Hwang W (2015) Estimating the number of shared species by a jackknife procedure. Environ Ecol Stat 22:759--778
work page 2015
-
[18]
Colombi A, Argiento R, Camerlenghi F, Paci L (2024) Hierarchical Mixture of Finite Mixtures (with Discussion) . Bayesian Anal pp 1 -- 29
work page 2024
-
[19]
(2009) Biodiversity: concepts, patterns, and measurement
Colwell RK, et al. (2009) Biodiversity: concepts, patterns, and measurement. The Princeton guide to ecology 663:257--263
work page 2009
-
[20]
De Blasi P, Favaro S, Lijoi A, Mena RH, Prünster I, Ruggiero M (2015) Are G ibbs-type priors the most natural generalization of the D irichlet process? IEEE Trans Pattern Anal Mach Intell 37(2):212--229
work page 2015
-
[21]
Efron B, Thisted R (1976) Estimating the number of unsen species: How many words did shakespeare know? Biometrika 63(3):435--447
work page 1976
-
[22]
J R Stat Soc Ser B Stat Methodol 71(5):993--1008
Favaro S, Lijoi A, Mena RH, Prünster I (2009) Bayesian non-parametric inference for species variety with a two-parameter P oisson– D irichlet process prior. J R Stat Soc Ser B Stat Methodol 71(5):993--1008
work page 2009
-
[23]
Favaro S, Lijoi A, Prünster I (2012) A new estimator of the discovery probability. Biometrics 68(4):1188--1196
work page 2012
-
[24]
Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Ann Statist 1(2):209--230
work page 1973
-
[25]
Journal of Animal Ecology 12(1):42--58
Fisher RA, Corbet AS, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12(1):42--58
work page 1943
-
[26]
PhD thesis, Bocconi University
Franzolini B (2022) On dependent processes in B ayesian nonparametrics: Theory, methods, and applications. PhD thesis, Bocconi University
work page 2022
-
[27]
Ghilotti L, Camerlenghi F, Rigon T (2024) Bayesian analysis of product feature allocation models. arXiv : 2408.15806
-
[28]
Electron Comm Probab 15:79 -- 88
Gnedin A (2010) A species sampling model with finitely many types. Electron Comm Probab 15:79 -- 88
work page 2010
-
[29]
Gnedin A, Pitman J (2006) Exchangeable G ibbs partitions and S tirling triangles. J Math Sci 138:5674--5685
work page 2006
-
[30]
Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3-4):237--264
work page 1953
-
[31]
Good IJ, Toulmin GH (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1-2):45--63
work page 1956
-
[32]
Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters 4(4):379--391
work page 2001
-
[33]
Lijoi A, Mena RH, Pr\" u nster I (2007) Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94(4):769--786
work page 2007
-
[34]
J Amer Statist Assoc 99(468):1108--1118
Mao CX (2004) Predicting the conditional probability of discovering a new class. J Amer Statist Assoc 99(468):1108--1118
work page 2004
-
[35]
J Amer Statist Assoc 113(521):340--356
Miller JW, Harrison MT (2018) Mixture models with a prior on the number of components. J Amer Statist Assoc 113(521):340--356
work page 2018
-
[36]
Mem Fac Sci Kyushu Univ Ser E (Biol) pp 215--235
Morisita M (1959) Measuring of dispersion of individuals and analysis of the distributional patterns. Mem Fac Sci Kyushu Univ Ser E (Biol) pp 215--235
work page 1959
-
[37]
Statist Probab Lett 7(3):191--194
Nayak TK (1988) A note on estimating the number of errors in a system by recapture sampling. Statist Probab Lett 7(3):191--194
work page 1988
-
[38]
Oksanen J, Simpson GL, Blanchet FG, Kindt R, Legendre P, Minchin PR, O'Hara R, Solymos P, Stevens MHH, Szoecs E, Wagner H, Barbour M, Bedward M, Bolker B, Borcard D, Borman T, Carvalho G, Chirico M, De Caceres M, Durand S, Evangelista HBA, FitzJohn R, Friendly M, Furneaux B, Hannigan G, Hill MO, Lahti L, McGlinn D, Ouellette MH, Ribeiro Cunha E, Smith T, ...
work page 2024
-
[39]
Proc Natl Acad Sci USA 113(47):13283--13288
Orlitsky A, Suresh AT, Wu Y (2016) Optimal prediction of the number of unseen species. Proc Natl Acad Sci USA 113(47):13283--13288
work page 2016
-
[40]
J Agric Biol Environ Stat 14(4):452--468
Pan HY, Chao A, Foissner W (2009) A nonparametric lower bound for the number of species shared by multiple communities. J Agric Biol Environ Stat 14(4):452--468
work page 2009
-
[41]
Probab Theory Related Fields 102(2):145--158
Pitman J (1995) Exchangeable and partially exchangeable random partitions. Probab Theory Related Fields 102(2):145--158
work page 1995
-
[42]
Statistics, Probability and Game Theory Papers in honor of David Blackwell 30:245--267
Pitman J (1996) Some developments of the B lackwell- M acqueen urn scheme. Statistics, Probability and Game Theory Papers in honor of David Blackwell 30:245--267
work page 1996
-
[43]
Quintana FA, M \"u ller P, Jara A, MacEachern SN (2022) The dependent dirichlet process and related models. Statist Sci 37(1):24--41
work page 2022
-
[44]
J Amer Statist Assoc 74(367):661--667
Rasmussen SL, Starr N (1979) Optimal and adaptive stopping in the search for new species. J Amer Statist Assoc 74(367):661--667
work page 1979
- [45]
-
[46]
J Amer Statist Assoc 101(476):1566--1581
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical D irichlet processes. J Amer Statist Assoc 101(476):1566--1581
work page 2006
-
[47]
J Statist Plann Inference 142(5):1031--1039
Yue JC, Clayton MK (2012) Sequential sampling in the search for new shared species. J Statist Plann Inference 142(5):1031--1039
work page 2012
-
[48]
Yue JC, Clayton MK, Hung CR (2022) Comparing nonparametric estimators for the number of shared species in two populations. Diversity 14(4)
work page 2022
-
[49]
Zara L, Tordoni E, Castro-Delgado S, Colla A, Maccherini S, Marignani M, Panepinto F, Trittoni M, Bacaro G (2021) Cross-taxon relationships in mediterranean urban ecosystem: A case study from the city of trieste. Ecol Indic 125:107538
work page 2021
-
[50]
J Amer Statist Assoc 118(544):2521--2532
Zito A, Rigon T, Ovaskainen O, Dunson DB (2023) Bayesian modeling of sequential discoveries. J Amer Statist Assoc 118(544):2521--2532
work page 2023
-
[51]
, " * write output.state after.block = add.period write newline
ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...
-
[52]
" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.