pith. sign in

arxiv: 2502.04122 · v3 · submitted 2025-02-06 · 📊 stat.ME

Bayesian discovery of species in multiple areas

Pith reviewed 2026-05-23 03:54 UTC · model grok-4.3

classification 📊 stat.ME
keywords species samplingBayesian nonparametricsheterogeneous populationsdistinct speciesshared speciespredictive distributionssample size determinationecological statistics
0
0 comments X

The pith

Bayesian nonparametric priors on two heterogeneous areas yield exact distributions for observed and predicted counts of distinct and shared species.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for species sampling when observations come from two distinct areas rather than a single homogeneous population. It derives the full distributional theory that describes the numbers of distinct and shared species in any observed sample. The same theory supplies exact predictive distributions for the numbers of unseen distinct and shared species that would appear in one or more additional samples of arbitrary sizes. These predictions also support calculations that determine how large a future sample must be to detect a target number of species. Readers care because the approach handles realistic ecological settings where environments differ and moves beyond frequentist methods restricted to single-step forecasts.

Core claim

By modeling species abundances in each of the two areas with Bayesian nonparametric priors that preserve exchangeability within areas and induce a joint predictive structure across areas, the authors obtain the exact joint distribution of the counts of distinct species in each area and the count of species shared between areas for any finite observed sample. The same construction directly supplies the predictive distributions for the corresponding counts in future samples drawn from the same two areas, for any choice of future sample sizes.

What carries the argument

The distributional theory for in-sample and out-of-sample counts of distinct and shared species induced by Bayesian nonparametric priors on the two-area species abundance measures.

If this is right

  • Exact predictive distributions become available for the number of unseen distinct species in each area and the number of shared unseen species between areas, for any future sample sizes.
  • The theory extends one-step-ahead frequentist estimators to arbitrary numbers of future observations and supplies full probability distributions rather than point estimates.
  • Sample-size calculations are possible for any target number of distinct or shared species to be detected.
  • In-sample analysis of any finite observed sample yields the joint distribution of distinct and shared species counts without approximation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction could be applied to other two-population problems such as estimating unique terms shared between two text corpora.
  • If the priors can be extended while preserving the predictive structure, the method would apply to three or more areas.
  • The distributional results could be used to design stratified sampling schemes that account for known habitat differences between areas.

Load-bearing premise

The species abundances in the two areas are generated by Bayesian nonparametric priors that deliver the required within-area exchangeability and cross-area predictive structure for shared and distinct species.

What would settle it

Draw new samples from two areas with known heterogeneity, compute the observed numbers of new distinct and shared species, and check whether those numbers fall inside the probability intervals given by the model's predictive distributions.

Figures

Figures reproduced from arXiv: 2502.04122 by Alessandro Colombi, Federico Camerlenghi, Lucia Paci, Raffaele Argiento.

Figure 1
Figure 1. Figure 1: Left panel: estimated probability of discovery a new shared species, evaluated [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
read the original abstract

In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript develops a Bayesian nonparametric framework for species sampling from two heterogeneous areas. It derives exact distributional results for in-sample analysis of observed samples and predictive formulas for the number of unseen distinct and shared species in out-of-sample draws of arbitrary size, extending one-step-ahead frequentist estimators; the results are also applied to sample-size determination and illustrated on an ant population dataset from Trieste.

Significance. If the derivations are correct, the work would be significant for filling a gap in the BNP species-sampling literature: it supplies closed-form distributional and predictive results for the heterogeneous two-area case (previously limited to homogeneous populations or purely computational methods), thereby enabling exact inference, arbitrary-horizon prediction, and design applications without simulation.

minor comments (2)
  1. [Abstract] Abstract: the specific BNP priors (e.g., Dirichlet process, Pitman-Yor, or normalized completely random measures) inducing the required exchangeability are left implicit; stating them explicitly would clarify the modeling assumptions that enable the claimed predictive structure.
  2. [Abstract] The abstract states that 'distributional theory and predictions are provided,' yet the provided text contains no displayed equations, error bounds, or verification steps; ensure the full manuscript includes at least one representative derivation (e.g., the joint distribution of unseen species counts) with a clear statement of the exchangeability assumptions used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, positive summary, and recommendation of minor revision. No specific major comments were raised in the report, so we provide no point-by-point responses below. We will incorporate any minor editorial or typographical suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives exact distributional results and predictive formulas for in-sample and out-of-sample species counts under a two-area heterogeneous BNP model. These are presented as new mathematical extensions from the homogeneous case, not as quantities obtained by fitting parameters to data and then renaming them as predictions, nor by self-citation chains that reduce the central claim to prior unverified work by the same authors. The modeling assumptions (exchangeability and predictive structure induced by BNP priors) are fixed inputs, and the contribution consists of independent derivations from those assumptions, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger entries are inferred at the level of standard modeling assumptions rather than specific fitted quantities or new entities.

axioms (1)
  • domain assumption Species abundances in each area follow a Bayesian nonparametric prior permitting exchangeable sampling and predictive distributions for unseen species.
    Standard background assumption in Bayesian nonparametrics for species sampling problems, invoked to enable the claimed distributional theory.

pith-pipeline@v0.9.0 · 5710 in / 1124 out tokens · 48399 ms · 2026-05-23T03:54:20.927490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Abramowitz M, Stegun IA (1964) Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series, vol No. 55. U. S. Government Printing Office, Washington, DC, for sale by the Superintendent of Documents

  2. [2]

    Ann Statist 50(5):2641--2663

    Argiento R, De Iorio M (2022) Is infinity that far? A B ayesian nonparametric perspective of finite mixture models. Ann Statist 50(5):2641--2663

  3. [3]

    J Statist Plann Inference 166:14--23, special Issue on Bayesian Nonparametrics

    Bacallado S, Favaro S, Trippa L (2015) Bayesian nonparametric inference for shared species richness in multiple populations. J Statist Plann Inference 166:14--23, special Issue on Bayesian Nonparametrics

  4. [4]

    Bayesian Anal 1:1--26

    Balocchi C, Camerlenghi F, Favaro S (2024) A bayesian nonparametric approach to species sampling problems with ordering. Bayesian Anal 1:1--26

  5. [5]

    species-sampling

    Balocchi C, Favaro S, Naulet Z (2025) Bayesian nonparametric inference for “species-sampling” problems. Statist Sci (to appear.)

  6. [6]

    J Multivariate Anal 156:18--28

    Camerlenghi F, Lijoi A, Prünster I (2017) Bayesian prediction with multiple-samples information. J Multivariate Anal 156:18--28

  7. [7]

    Ann Statist 47(1):67--92

    Camerlenghi F, Lijoi A, Orbanz P, Pr\" u nster I (2019) Distribution theory for hierarchical processes. Ann Statist 47(1):67--92

  8. [8]

    J Amer Statist Assoc 119(545):320--331

    Camerlenghi F, Favaro S, Masoero L, Broderick T (2024) Scaled process priors for B ayesian nonparametric estimation of the unseen genetic variation. J Amer Statist Assoc 119(545):320--331

  9. [9]

    Scand J Stat 11(4):265--270

    Chao A (1984) Nonparametric estimation of the number of classes in a population. Scand J Stat 11(4):265--270

  10. [10]

    Biometrika 80(1):193--201

    Chao A, Yang MCK (1993) Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80(1):193--201

  11. [11]

    Statist Sinica 10(1):227--246

    Chao A, Hwang WH, Chen YC, Kuo CY (2000) Estimating the number of shared species in two communities. Statist Sinica 10(1):227--246

  12. [12]

    Biometrics 62(2):361--371

    Chao A, Chazdon RL, Colwell RK, Shen TJ (2006 a ) Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62(2):361--371

  13. [13]

    Aust N Z J Stat 48(2):117--128

    Chao A, Shen TJ, Hwang WH (2006 b ) Application of laplace's boundary-mode approximations to estimate species and shared species richness. Aust N Z J Stat 48(2):117--128

  14. [14]

    Species-richness prediction and diversity estimation with R

    Chao A, Ma K, Hsieh T, Chiu CH, Chao MA (2016) Package SpadeR . Species-richness prediction and diversity estimation with R

  15. [15]

    Ecology 98(11):2914--2929

    Chao A, Chiu CH, Colwell RK, Magnago LFS, Chazdon RL, Gotelli NJ (2017) Deciphering the enigma of undetected species, phylogenetic, and functional diversity based on good-turing theory. Ecology 98(11):2914--2929

  16. [16]

    CRC Press

    Charalambides CA (2002) Enumerative combinatorics. CRC Press

  17. [17]

    Environ Ecol Stat 22:759--778

    Chuang C, Shen T, Hwang W (2015) Estimating the number of shared species by a jackknife procedure. Environ Ecol Stat 22:759--778

  18. [18]

    Bayesian Anal pp 1 -- 29

    Colombi A, Argiento R, Camerlenghi F, Paci L (2024) Hierarchical Mixture of Finite Mixtures (with Discussion) . Bayesian Anal pp 1 -- 29

  19. [19]

    (2009) Biodiversity: concepts, patterns, and measurement

    Colwell RK, et al. (2009) Biodiversity: concepts, patterns, and measurement. The Princeton guide to ecology 663:257--263

  20. [20]

    De Blasi P, Favaro S, Lijoi A, Mena RH, Prünster I, Ruggiero M (2015) Are G ibbs-type priors the most natural generalization of the D irichlet process? IEEE Trans Pattern Anal Mach Intell 37(2):212--229

  21. [21]

    Efron B, Thisted R (1976) Estimating the number of unsen species: How many words did shakespeare know? Biometrika 63(3):435--447

  22. [22]

    J R Stat Soc Ser B Stat Methodol 71(5):993--1008

    Favaro S, Lijoi A, Mena RH, Prünster I (2009) Bayesian non-parametric inference for species variety with a two-parameter P oisson– D irichlet process prior. J R Stat Soc Ser B Stat Methodol 71(5):993--1008

  23. [23]

    Biometrics 68(4):1188--1196

    Favaro S, Lijoi A, Prünster I (2012) A new estimator of the discovery probability. Biometrics 68(4):1188--1196

  24. [24]

    Ann Statist 1(2):209--230

    Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Ann Statist 1(2):209--230

  25. [25]

    Journal of Animal Ecology 12(1):42--58

    Fisher RA, Corbet AS, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12(1):42--58

  26. [26]

    PhD thesis, Bocconi University

    Franzolini B (2022) On dependent processes in B ayesian nonparametrics: Theory, methods, and applications. PhD thesis, Bocconi University

  27. [27]

    arXiv : 2408.15806

    Ghilotti L, Camerlenghi F, Rigon T (2024) Bayesian analysis of product feature allocation models. arXiv : 2408.15806

  28. [28]

    Electron Comm Probab 15:79 -- 88

    Gnedin A (2010) A species sampling model with finitely many types. Electron Comm Probab 15:79 -- 88

  29. [29]

    J Math Sci 138:5674--5685

    Gnedin A, Pitman J (2006) Exchangeable G ibbs partitions and S tirling triangles. J Math Sci 138:5674--5685

  30. [30]

    Biometrika 40(3-4):237--264

    Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3-4):237--264

  31. [31]

    Biometrika 43(1-2):45--63

    Good IJ, Toulmin GH (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1-2):45--63

  32. [32]

    Ecology Letters 4(4):379--391

    Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters 4(4):379--391

  33. [33]

    Biometrika 94(4):769--786

    Lijoi A, Mena RH, Pr\" u nster I (2007) Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94(4):769--786

  34. [34]

    J Amer Statist Assoc 99(468):1108--1118

    Mao CX (2004) Predicting the conditional probability of discovering a new class. J Amer Statist Assoc 99(468):1108--1118

  35. [35]

    J Amer Statist Assoc 113(521):340--356

    Miller JW, Harrison MT (2018) Mixture models with a prior on the number of components. J Amer Statist Assoc 113(521):340--356

  36. [36]

    Mem Fac Sci Kyushu Univ Ser E (Biol) pp 215--235

    Morisita M (1959) Measuring of dispersion of individuals and analysis of the distributional patterns. Mem Fac Sci Kyushu Univ Ser E (Biol) pp 215--235

  37. [37]

    Statist Probab Lett 7(3):191--194

    Nayak TK (1988) A note on estimating the number of errors in a system by recapture sampling. Statist Probab Lett 7(3):191--194

  38. [38]

    Oksanen J, Simpson GL, Blanchet FG, Kindt R, Legendre P, Minchin PR, O'Hara R, Solymos P, Stevens MHH, Szoecs E, Wagner H, Barbour M, Bedward M, Bolker B, Borcard D, Borman T, Carvalho G, Chirico M, De Caceres M, Durand S, Evangelista HBA, FitzJohn R, Friendly M, Furneaux B, Hannigan G, Hill MO, Lahti L, McGlinn D, Ouellette MH, Ribeiro Cunha E, Smith T, ...

  39. [39]

    Proc Natl Acad Sci USA 113(47):13283--13288

    Orlitsky A, Suresh AT, Wu Y (2016) Optimal prediction of the number of unseen species. Proc Natl Acad Sci USA 113(47):13283--13288

  40. [40]

    J Agric Biol Environ Stat 14(4):452--468

    Pan HY, Chao A, Foissner W (2009) A nonparametric lower bound for the number of species shared by multiple communities. J Agric Biol Environ Stat 14(4):452--468

  41. [41]

    Probab Theory Related Fields 102(2):145--158

    Pitman J (1995) Exchangeable and partially exchangeable random partitions. Probab Theory Related Fields 102(2):145--158

  42. [42]

    Statistics, Probability and Game Theory Papers in honor of David Blackwell 30:245--267

    Pitman J (1996) Some developments of the B lackwell- M acqueen urn scheme. Statistics, Probability and Game Theory Papers in honor of David Blackwell 30:245--267

  43. [43]

    Statist Sci 37(1):24--41

    Quintana FA, M \"u ller P, Jara A, MacEachern SN (2022) The dependent dirichlet process and related models. Statist Sci 37(1):24--41

  44. [44]

    J Amer Statist Assoc 74(367):661--667

    Rasmussen SL, Starr N (1979) Optimal and adaptive stopping in the search for new species. J Amer Statist Assoc 74(367):661--667

  45. [45]

    Nature 688:163

    Simpson E (1949) Measurement of diversity. Nature 688:163

  46. [46]

    J Amer Statist Assoc 101(476):1566--1581

    Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical D irichlet processes. J Amer Statist Assoc 101(476):1566--1581

  47. [47]

    J Statist Plann Inference 142(5):1031--1039

    Yue JC, Clayton MK (2012) Sequential sampling in the search for new shared species. J Statist Plann Inference 142(5):1031--1039

  48. [48]

    Diversity 14(4)

    Yue JC, Clayton MK, Hung CR (2022) Comparing nonparametric estimators for the number of shared species in two populations. Diversity 14(4)

  49. [49]

    Ecol Indic 125:107538

    Zara L, Tordoni E, Castro-Delgado S, Colla A, Maccherini S, Marignani M, Panepinto F, Trittoni M, Bacaro G (2021) Cross-taxon relationships in mediterranean urban ecosystem: A case study from the city of trieste. Ecol Indic 125:107538

  50. [50]

    J Amer Statist Assoc 118(544):2521--2532

    Zito A, Rigon T, Ovaskainen O, Dunson DB (2023) Bayesian modeling of sequential discoveries. J Amer Statist Assoc 118(544):2521--2532

  51. [51]

    , " * write output.state after.block = add.period write newline

    ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...

  52. [52]

    write newline

    " write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...