Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data
Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3
The pith
The Nested Atoms Model jointly clusters genetically similar individuals and their cells by gene expression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Nested Atoms Model is a Bayesian nonparametric approach for two-layered clustering of nested data that jointly models group-level variables such as individual-specific genotypes and observation-level variables such as cell-specific gene expressions. It defines cluster assignments at both the individual level and the cell level within individuals to capture heterogeneity across scales. A fast variational Bayesian inference procedure is developed to handle large-scale high-dimensional data. Simulations show improved performance over methods that ignore group-level information, and application to the OneK1K single-cell dataset with 982 individuals and 1.27 million cells produces cell-type–s
What carries the argument
The Nested Atoms Model, a Bayesian nonparametric model that performs simultaneous clustering of groups using group-level variables and of observations within groups using observation-level variables.
If this is right
- NAM outperforms existing grouped clustering methods that ignore group-level variables in simulation studies.
- The variational inference algorithm allows scaling to population-scale datasets with over a million cells.
- Application to the OneK1K dataset yields clusters of genetically similar individuals whose cell profiles are homogeneous and align with known immune cell types.
- The model enables investigation of how genetic variations among individuals influence differences in cell-type profiles.
Where Pith is reading between the lines
- The two-level structure could be adapted to other nested biological datasets, such as patients with multiple tissue samples.
- Successful alignment with known cell types suggests the model might help discover new genetically influenced cell subtypes.
- Adding covariates at either the group or observation level would extend the framework to more complex multi-omics settings.
Load-bearing premise
That jointly modeling group-level genotype variables together with observation-level gene expression will produce clusters that meaningfully capture heterogeneity at both levels and align with external biological knowledge.
What would settle it
If the cell clusters identified by NAM on the OneK1K data do not show differential gene expression patterns consistent with known immune cell types, or if simulations show no performance gain over methods that ignore group-level variables.
Figures
read the original abstract
We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Nested Atoms Model (NAM), a Bayesian nonparametric approach for joint two-layer clustering of nested data with group-level variables (e.g., individual genotypes) and observation-level variables (e.g., single-cell gene expressions). It develops a scalable variational Bayesian inference algorithm and applies the model to the OneK1K dataset (982 individuals, 1.27M cells). Simulations are reported to show outperformance over methods that ignore group-level variables, while the real-data analysis claims that NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles whose cell clusters align with known immune cell types via differential gene expression.
Significance. If the central claims hold, NAM would fill a methodological gap in grouped clustering by incorporating group-level covariates into a nonparametric framework, enabling joint inference on genetic and cellular heterogeneity at population scale. The variational inference procedure is a practical strength for big data. However, the significance is limited by the absence of quantitative evidence that the genotype layer contributes meaningfully to the reported biological alignments beyond what expression-only clustering would achieve.
major comments (2)
- [Application to OneK1K dataset] Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.
- [Simulation studies] Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction could more explicitly define the variational approximation (e.g., mean-field assumptions or ELBO terms) rather than deferring all details to the methods section.
- [Model specification] Notation for the nested atoms and atom assignments is introduced without a clear summary table or diagram, which would aid readability for readers unfamiliar with nested nonparametric models.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment point by point below, with revisions proposed where they improve clarity and evidence without altering the manuscript's core claims.
read point-by-point responses
-
Referee: Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.
Authors: We agree that additional quantitative support would strengthen the demonstration that the genotype layer contributes to the observed alignments. In the revised manuscript we will add an ablation experiment fitting an expression-only variant of NAM and comparing its cell clusters to the full model via adjusted Rand index. We will also report a permutation-based test of marker-gene enrichment strength and enrichment scores against known immune cell-type markers. Full cell-level expert annotations are unavailable for the full 1.27 M cells, so direct ARI against ground truth is not feasible; the between-model ARI and permutation tests nevertheless provide quantitative evidence that the nested structure is load-bearing. revision: yes
-
Referee: Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.
Authors: We thank the referee for highlighting the need for greater reproducibility. The simulation section already specifies a nested generative process that draws group-level genotype clusters from a Dirichlet process and then observation-level expression clusters conditional on group membership, with controlled noise levels. To make this fully verifiable we will expand the section to state the exact generative parameters, report ARI and normalized mutual information for both the group and observation layers, and list the precise baseline implementations (including package versions and hyper-parameter settings). These additions will allow readers to confirm that the reported gains isolate the benefit of the group-level variables. revision: yes
Circularity Check
No significant circularity detected in NAM derivation
full rationale
The paper introduces NAM as a novel Bayesian nonparametric model for joint clustering of nested data with group-level genotypes and observation-level expressions, develops a new variational inference algorithm to scale it, and validates via simulations (outperforming baselines that ignore group variables) plus application to OneK1K data where cell clusters align with known immune types via differential expression. No derivation step reduces by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims rest on the new model structure and external simulation/biological checks rather than tautological renaming or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1,
Forj= 1, . . . , J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1, . . . , K, logρ jk =g(¯ak,¯bk) + k−1X r=1 g(¯br,¯ar) + LX l=1 " njX i=1 ξjil !( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) )# + + 1 2 ℓ(x,1) k +ℓ (x,2) jk , where g(x, y) =ψ(x)−ψ(x+y),withψdenoting the digamma function, ℓ(x,1) k = qX i=1 ψ((c x k −i+ 1)/2) +qlog 2 + log|D x ...
-
[2]
Forj= 1, . . . , Jandi= 1, . . . , n j,q ⋆(Mji) is aL-dimensional multinomial, withq ⋆(Mji =l) =ξ jil forl= 1, . . . , L, logξ jil = 1 2 ℓ(y,1) l +ℓ (y,2) jil + KX k=1 ρjk ( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) ) , where ℓ(y,1) l = pX i=1 ψ (cy l −i+ 1)/2 +plog 2 + log|D y l |and ℓ(y,2) jil =−p/t y l −c y l (yji −m y l )T Dy l (yji −m y l )
-
[3]
Fork= 1, . . . , Kandl= 1, . . . , L−1,q ⋆(ulk) is aBeta(¯alk,¯blk) distribution with ¯alk = 1 + JX j=1 njX i=1 ξjit , ¯blk =r 1/r2 + JX j=1 ρjk njX i=1 LX t=l+1 ξjit . 41
-
[4]
Fork= 1, . . . , K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl
-
[5]
Fork= 1, . . . , K,q ⋆(µx k,Λ x k) is a NW(m x k, tx k, cx k,D x k) distribution with parameters mx k =t x−1 k (λx 0 µx 0 +N x k ¯xk), t x k =λ x 0 +N x k , c x k =ν x 0 +N x k , Dx−1 k =Ψ x−1 0 + λx 0 N x k λx 0 +N x k n (¯xk −µ x
-
[6]
(¯xk −µ x 0)T o +S x k, where N x k = JX j=1 ρjk ,¯x k =N x−1 k JX j=1 ρjk xj , Sx k = JX j=1 ρjk n (xj −¯xk) (xj −¯xk)T o
-
[7]
Forl= 1, . . . , L,q ⋆(µy l ,Λ y l ) is a NW(m y l , ty l , cy l ,D y l ) distribution with parameters my l =t y−1 l (λy 0 µy 0 +N y l ¯yl), t y l =λ y 0 +N y l , c y l =ν y 0 +N y l , Dy−1 l =Ψ y−1 0 + λy 0 N y l λy 0 +N y l n ¯yl −µ y 0 ¯yl −µ y 0 To +S y l , where N y l = JX j=1 njX i=1 ξjil ,¯y l =N y−1 l JX j=1 njX i=1 ξjil yji , Sy l = JX j=1 njX i=...
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.