Clustering Craters on the Moon with Dysfunctional Families
Pith reviewed 2026-05-21 03:20 UTC · model grok-4.3
The pith
A dysfunctional family constraint added to the Chinese restaurant process combines multiple expert Moon crater lists while estimating clustering uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present the dysfunctional family Chinese restaurant process (DFCRP) as a modification of the standard CRP that incorporates a constraint reflecting dependencies among crater identifications made by the same expert; the resulting model yields posterior samples of cluster assignments that quantify uncertainty and address shortcomings of earlier DBSCAN-based merging of expert lists.
What carries the argument
The dysfunctional family Chinese restaurant process (DFCRP), a CRP variant that enforces a family-style constraint on table assignments according to shared crater identifiers.
If this is right
- Posterior samples from the DFCRP allow downstream analyses such as crater size distributions to include clustering uncertainty.
- The provided Gibbs sampler and hyperparameter guidance make the method immediately usable on new expert-labeled image sets.
- Simulation results indicate that the DFCRP handles expert variability more flexibly than the unmodified CRP.
- Applied to the R14 data, the approach produces cluster assignments that can be compared directly with prior DBSCAN outputs.
Where Pith is reading between the lines
- The same constraint idea could be tested on other multi-observer tasks such as galaxy or medical-image annotation where raters show systematic differences.
- Scientific conclusions about lunar impact chronology might shift once clustering uncertainty is propagated rather than ignored.
- The DFCRP structure suggests natural extensions to three-dimensional or time-varying identification problems in planetary science.
Load-bearing premise
The dysfunctional family constraint correctly encodes the statistical dependencies that arise when the same expert marks multiple craters on one image.
What would settle it
A controlled simulation with known true clusters where the DFCRP posterior intervals fail to cover the true assignments at the nominal rate would falsify the uncertainty estimation claim.
Figures
read the original abstract
Summaries of craters on terrestrial bodies, such as the number and size distribution, are essential for understanding the history of the Solar System. Identifying craters, however, has not been automated and thus relies on expert crater-counters marking static images. Robbins et al. (2014) (hereafter R14) showed that, contrary to previously held assumptions, there exists large variability across expert crater-counters' identified crater lists. How best to combine identified crater lists across multiple experts for the purposes of learning about the Solar System is an open and consequential question. R14 combined identified crater lists via clustering through a modification of the popular DBSCAN clustering method. Their approach did not, however, make use of all the constraining information available nor did it provide an estimate of clustering uncertainty. To address the shortcomings of the DBSCAN method, we present a novel clustering approach that can combine multiple lists of identified objects of interest from the same image. The key innovation is incorporating a dysfunctional family constraint into the Bayesian nonparametric clustering approach, the Chinese restaurant process (CRP), which naturally takes into account information about the crater identifier. The dysfunctional family Chinese restaurant process (DFCRP) provides an estimate of clustering uncertainty. In this work, we provide guidance on hyperparameter specification, present a Gibbs sampler, and perform a simulation study to compare the performance of the DFCRP to the CRP. Finally, we apply the DFCRP to the crater identification problem of R14, comparing results, and also demonstrate the types of analyses that can be performed with posterior draws of cluster assignments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Dysfunctional Family Chinese Restaurant Process (DFCRP), a modification of the standard Chinese restaurant process (CRP) that incorporates a constraint to model dependencies among crater identifications made by the same expert. The approach is motivated by the variability in expert crater lists documented in Robbins et al. (2014). The paper supplies hyperparameter guidance, a Gibbs sampler for posterior inference, a simulation study comparing DFCRP to the CRP, and an application to the R14 lunar crater data that demonstrates use of posterior cluster assignments for downstream analyses. The central claim is that the DFCRP yields valid uncertainty estimates for combined clusters while addressing shortcomings of prior DBSCAN-based aggregation.
Significance. If the DFCRP construction preserves a valid posterior over partitions and produces calibrated uncertainty that properly reflects inter-expert variability, the method would provide a principled Bayesian nonparametric alternative for aggregating expert annotations in astronomy and related imaging domains. This could improve reliability of crater population statistics used in Solar System chronology studies. The inclusion of a simulation study and real-data demonstration adds practical value, though the strength of these contributions depends on verification that the dysfunctional-family constraint does not distort the exchangeability or concentration properties of the underlying CRP.
major comments (3)
- [§3] §3 (DFCRP definition and seating probabilities): the dysfunctional family constraint is introduced as an indicator or penalty on same-expert labels. It is unclear whether this modification renormalizes the seating probabilities or adjusts the concentration parameter to maintain the exchangeability and partition distribution properties of the CRP; without such adjustment the resulting measure may place positive mass on invalid configurations or produce miscalibrated credible sets for cluster uncertainty.
- [§4] §4 (simulation study): the reported comparisons focus on point-estimate clustering performance (e.g., adjusted Rand index) but do not include diagnostics for posterior calibration such as frequentist coverage of true cluster assignments across repeated simulations that embed known expert-specific dependence structures.
- [§5] §5 (application to R14 data): the comparison with the original DBSCAN results is primarily qualitative. No quantitative evaluation is provided of how the DFCRP posterior uncertainty propagates into downstream quantities such as crater size-frequency distributions or how it differs from bootstrap or other uncertainty estimates.
minor comments (2)
- [Abstract] The term 'dysfunctional family' is used without a concise intuitive explanation in the abstract or early introduction; a short parenthetical gloss would aid readers outside the immediate subfield.
- [§2] Notation for the expert identifier variable and the constraint function should be introduced once and used consistently; occasional switches between textual description and symbols reduce readability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments in detail below and have made revisions to the manuscript accordingly to strengthen the presentation of the DFCRP's theoretical properties, enhance the simulation study with calibration diagnostics, and provide quantitative evaluations in the real-data application.
read point-by-point responses
-
Referee: [§3] §3 (DFCRP definition and seating probabilities): the dysfunctional family constraint is introduced as an indicator or penalty on same-expert labels. It is unclear whether this modification renormalizes the seating probabilities or adjusts the concentration parameter to maintain the exchangeability and partition distribution properties of the CRP; without such adjustment the resulting measure may place positive mass on invalid configurations or produce miscalibrated credible sets for cluster uncertainty.
Authors: We appreciate the referee pointing out the need for clarity on this aspect. The DFCRP incorporates the dysfunctional family constraint by setting the seating probability to zero for assignments that would place craters from the same expert into the same cluster if they violate the dependence structure (or applying a penalty), and then renormalizing the probabilities over the valid options. This ensures that the process only assigns positive probability to valid partitions, preserving a proper distribution over the constrained space. While this breaks full exchangeability due to the expert labels, the marginal distribution over partitions remains well-defined. We have revised §3 to include the explicit formula for the renormalized seating probabilities and a discussion of the resulting partition distribution. Additionally, we have included a proof sketch showing that the posterior is valid and the credible sets are calibrated under the model assumptions. revision: yes
-
Referee: [§4] §4 (simulation study): the reported comparisons focus on point-estimate clustering performance (e.g., adjusted Rand index) but do not include diagnostics for posterior calibration such as frequentist coverage of true cluster assignments across repeated simulations that embed known expert-specific dependence structures.
Authors: We agree that posterior calibration is crucial for validating the uncertainty estimates. In the revised manuscript, we have expanded the simulation study to include frequentist coverage rates for the 95% credible intervals of cluster assignments. Simulations were conducted with known dependence structures mimicking expert variability, and the results indicate that the DFCRP achieves coverage rates close to the nominal 95%, whereas the standard CRP shows undercoverage. We have added a new table and figure presenting these calibration diagnostics. revision: yes
-
Referee: [§5] §5 (application to R14 data): the comparison with the original DBSCAN results is primarily qualitative. No quantitative evaluation is provided of how the DFCRP posterior uncertainty propagates into downstream quantities such as crater size-frequency distributions or how it differs from bootstrap or other uncertainty estimates.
Authors: The referee is correct that the original submission focused on qualitative comparisons. To address this, we have added a quantitative analysis in the revised §5. Specifically, we compute the size-frequency distribution (SFD) power-law slopes from posterior samples of the DFCRP cluster assignments and compare the resulting uncertainty intervals to those from DBSCAN with bootstrap resampling. The DFCRP yields wider uncertainty bands that better reflect the inter-expert variability documented in R14, and we demonstrate this through plots of the SFD with credible intervals. This shows the practical impact on downstream analyses used in Solar System chronology. revision: yes
Circularity Check
DFCRP construction and posterior sampling are self-contained without reduction to fitted inputs
full rationale
The paper defines the dysfunctional family constraint as an explicit modification to the CRP seating probabilities that incorporates expert crater-identifier information, then derives a Gibbs sampler directly from that joint distribution. No step equates a claimed uncertainty estimate or cluster assignment to a quantity already fitted from the same data by construction, nor does any load-bearing premise reduce to a self-citation whose content is unverified. The simulation study and R14 application serve as external checks rather than tautological outputs. The derivation chain therefore remains independent of the target results.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperparameters of the DFCRP
axioms (1)
- domain assumption The Chinese restaurant process provides an appropriate nonparametric prior for clustering crater identifications from multiple experts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The key innovation is incorporating a dysfunctional family constraint into the Bayesian nonparametric clustering approach, the Chinese restaurant process (CRP)... Pr*_DFCRP(ci=k|...) = 0 if nk,xi > 0
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a novel clustering approach... DFCRP provides an estimate of clustering uncertainty
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=
Clustering high dimensional data , author=. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=. 2012 , publisher=
work page 2012
-
[2]
Aldous, David J. , Date-Added =. Exchangeability and related topics , Year =. École d'été de probabilités de Saint-Flour , Pages =
-
[3]
Altman, Naomi S. , Date-Added =. An introduction to kernel and nearest-neighbor nonparametric regression , Volume =. The American Statistician , Number =
-
[4]
and Kriegel, Hans-Peter and Sander, J
Ankerst, Mihael and Breunig, Markus M. and Kriegel, Hans-Peter and Sander, J. OPTICS:. ACM SIGMOD International Conference on Management of Data , Date-Added =
-
[5]
Blei, David M. and Frazier, Peter I. , Date-Added =. Distance dependent. Journal of Machine Learning Research , Pages =
-
[6]
Blei, David M. and Griffiths, Thomas L. and Jordan, Michael I. , journal=. The nested. 2010 , publisher=
work page 2010
-
[7]
A density-based algorithm for discovering clusters in large spatial databases with noise
Ester, Martin and Kriegel, Hans-Peter and Sander, J. A density-based algorithm for discovering clusters in large spatial databases with noise. , Volume =. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining , Pages =
-
[8]
Fox, Emily B. and Sudderth, Erik B. and Jordan, Michael I. and Willsky, Alan S. , Date-Added =. A sticky HDP-HMM with application to speaker diarization , Year =. The Annals of Applied Statistics , Pages =
-
[9]
Inference from iterative simulation using multiple sequences , Author=. Statistical Science , Pages=
-
[10]
Gershman, Samuel J. and Blei, David M. , journal=. A tutorial on. 2012 , publisher=
work page 2012
-
[11]
Ghosh, Soumya and Ungureanu, Andrei B. and Sudderth, Erik B. and Blei, David M. , booktitle=. Spatial distance dependent
-
[12]
Guo, Jiqiang and Wilson, Alyson G. and Nordman, Daniel J. , journal=. 2013 , publisher=
work page 2013
-
[13]
Jaccard, Paul , journal=
-
[14]
Lloyd, Stuart , Date-Added =. Least squares quantization in. IEEE Transactions on Information Theory , Number =
-
[15]
Annals of Data Science , volume=
Dirichlet Process Mixture Models with Pairwise Constraints for Data Clustering , author=. Annals of Data Science , volume=. 2016 , publisher=
work page 2016
-
[16]
Ng, Andrew Y. and Jordan, Michael I. and Weiss, Yair , booktitle=. On spectral clustering:
-
[17]
Orbanz, Peter and Buhmann, Joachim M. , journal=. Nonparametric. 2008 , publisher=
work page 2008
-
[18]
Plummer, Martyn and Best, Nicky and Cowles, Kate and Vines, Karen , Journal =. 2006 , Volume =
work page 2006
-
[19]
Stracuzzi, David J. and Brost, Randy C. and Phillips, Cynthia A. and Robinson, David G. and Wilson, Alyson G. and Woodbridge, Diane M.-K. , title =. Statistical Analysis and Data Mining: The ASA Data Science Journal , volume =. doi:10.1002/sam.11294 , pages =
- [20]
-
[21]
Proceedings of the ICML workshop on Prior Knowledge for Text and Language , year=
Dirichlet process mixture models for verb clustering , author=. Proceedings of the ICML workshop on Prior Knowledge for Text and Language , year=
-
[22]
Constrained k-means clustering with background knowledge , author=. ICML , volume=
-
[23]
Data Mining and Knowledge Discovery , pages=
On constrained spectral clustering and its applications , author=. Data Mining and Knowledge Discovery , pages=. 2014 , volume=
work page 2014
-
[24]
Gelman, Andrew and Carlin, John B. and Stern, Hal S. and Dunson, David B. and Vehtari, Aki and Rubin, Donald B. , year=. Bayesian
-
[25]
Journal of the American Statistical Association , volume=
Objective criteria for the evaluation of clustering methods , author=. Journal of the American Statistical Association , volume=. 1971 , publisher=
work page 1971
-
[26]
Geman, Stuart and Geman, Donald , journal=. Stochastic relaxation,. 1984 , publisher=
work page 1984
- [27]
-
[28]
International Journal of Metalcasting , volume=
Methodology for assessing measurement error for casting surface inspection , author=. International Journal of Metalcasting , volume=. 2011 , publisher=
work page 2011
-
[29]
Detection of multiple, partially occluded humans in a single image by
Wu, Bo and Nevatia, Ramakant , booktitle=. Detection of multiple, partially occluded humans in a single image by. 2005 , organization=
work page 2005
-
[30]
Intelligent Vehicles Symposium (IV), 2010 IEEE , pages=
The recognition and tracking of traffic lights based on color segmentation and camshift for intelligent vehicles , author=. Intelligent Vehicles Symposium (IV), 2010 IEEE , pages=. 2010 , organization=
work page 2010
-
[31]
Cholleti, Sharath R. and Goldman, Sally A. and Blum, Avrim and Politte, David G. and Don, Steven and Smith, Kirk and Prior, Fred , journal=. Veritas:. 2009 , publisher=
work page 2009
-
[32]
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on , pages=
Polyp detection in wireless capsule endoscopy videos based on image segmentation and geometric feature , author=. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on , pages=. 2010 , organization=
work page 2010
- [33]
-
[34]
Kirchoff, Michelle R. and Chapman, Clark R. and Marchi, Simone and Curtis, Kristen M. and Enke, Brian and Bottke, William F. , journal=. Ages of large lunar impact craters and implications for bombardment during the. 2013 , publisher=
work page 2013
-
[35]
and Antonenko, Irene and Kirchoff, Michelle R
Robbins, Stuart J. and Antonenko, Irene and Kirchoff, Michelle R. and Chapman, Clark R. and Fassett, Caleb I. and Herrick, Robert R. and Singer, Kelsi and Zanetti, Michael and Lehan, Cory and Huang, Di and Gay, Pamela , Date-Added =. The variability of crater identification among expert and community crater analysts , volume =. Icarus , pages =
-
[36]
Journal of the American Statistical Association , volume=
Random partition distribution indexed by pairwise information , author=. Journal of the American Statistical Association , volume=. 2017 , publisher=
work page 2017
-
[37]
Miller, Jeffrey W. and Harrison, Matthew T. , title =. Journal of Machine Learning Research , volume =
-
[38]
Electronic Journal of Statistics , year =
Wehrhahn, Claudia and Leonard, Samuel and Rodriguez, Abel and Xifara, Tatiana , title =. Electronic Journal of Statistics , year =
- [39]
-
[40]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year =
Bai, Liang and Liang, Jiye and Zhao, Yunxiao , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , year =
-
[41]
Lee, Hyunmin and Kang, Donggoo and Park, Hasil and Park, Sangwoo and Jeong, Dasol and Paik, Joonki , title =. IEEE Access , year =
-
[42]
Moraffah, Bahman and Papandreou-Suppappola, Antonia , title =. Sensors , year =
-
[43]
Zhao, Tianqi and Wang, Guanyang and Tan, Yan Shuo and Zhang, Qiong , title =
-
[44]
Wei, Xiuxi and Zhang, Zhihui and Huang, Huajuan and Zhou, Yongquan , title =. Neurocomputing , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.