Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

Arhit Chakrabarti; Bani K. Mallick; Yang Ni; Yuchao Jiang

arxiv: 2604.11731 · v1 · submitted 2026-04-13 · 📊 stat.ME · stat.AP· stat.ML

Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

Arhit Chakrabarti , Yang Ni , Yuchao Jiang , Bani K. Mallick This is my paper

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 📊 stat.ME stat.APstat.ML

keywords nested clusteringBayesian nonparametricsingle-cell RNA-seqgenotype datavariational inferenceimmune cell typeshierarchical datapopulation-scale data

0 comments

The pith

The Nested Atoms Model jointly clusters genetically similar individuals and their cells by gene expression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses clustering of nested data where observations are grouped, with variables at both the group level like individual genotypes and the observation level like cell gene expressions. Existing methods for grouped clustering omit group-level variables, limiting their ability to capture how genetic differences among people shape their cell-type profiles in large single-cell datasets. NAM is introduced as a Bayesian nonparametric model that performs simultaneous two-layered clustering of individuals and cells while using both data types. A variational Bayesian algorithm scales the model to high-dimensional data with over a million cells. When applied to the OneK1K dataset, the resulting clusters of genetically similar individuals show homogeneous cell profiles that match known immune cell types through differential gene expression analysis.

Core claim

The Nested Atoms Model is a Bayesian nonparametric approach for two-layered clustering of nested data that jointly models group-level variables such as individual-specific genotypes and observation-level variables such as cell-specific gene expressions. It defines cluster assignments at both the individual level and the cell level within individuals to capture heterogeneity across scales. A fast variational Bayesian inference procedure is developed to handle large-scale high-dimensional data. Simulations show improved performance over methods that ignore group-level information, and application to the OneK1K single-cell dataset with 982 individuals and 1.27 million cells produces cell-type–s

What carries the argument

The Nested Atoms Model, a Bayesian nonparametric model that performs simultaneous clustering of groups using group-level variables and of observations within groups using observation-level variables.

If this is right

NAM outperforms existing grouped clustering methods that ignore group-level variables in simulation studies.
The variational inference algorithm allows scaling to population-scale datasets with over a million cells.
Application to the OneK1K dataset yields clusters of genetically similar individuals whose cell profiles are homogeneous and align with known immune cell types.
The model enables investigation of how genetic variations among individuals influence differences in cell-type profiles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-level structure could be adapted to other nested biological datasets, such as patients with multiple tissue samples.
Successful alignment with known cell types suggests the model might help discover new genetically influenced cell subtypes.
Adding covariates at either the group or observation level would extend the framework to more complex multi-omics settings.

Load-bearing premise

That jointly modeling group-level genotype variables together with observation-level gene expression will produce clusters that meaningfully capture heterogeneity at both levels and align with external biological knowledge.

What would settle it

If the cell clusters identified by NAM on the OneK1K data do not show differential gene expression patterns consistent with known immune cell types, or if simulations show no performance gain over methods that ignore group-level variables.

Figures

Figures reproduced from arXiv: 2604.11731 by Arhit Chakrabarti, Bani K. Mallick, Yang Ni, Yuchao Jiang.

**Figure 2.** Figure 2: Comparison of group-clustering accuracy of NAM with CAM and fiSAN against varying dimensions [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Top row: distributions of the (a) median and (b) maximum computing time with varying dimen [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Observational clustering for selected individuals by estimated group clusters (GCs). The colors [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Observational clustering for the same individuals by estimated group clusters (GCs). The colors [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Boxplot of gene expressions in the top six differentially expressed genes for selected individuals by [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Heatmaps of the top 20 differentially expressed genes obtained from pseudobulk differential ex [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Violin plots of the log counts per million (CPM) for the gene MS4A1, comparing expression levels [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Barplot summarizing the Gene Ontology enrichment analysis for the genes differentially expressed [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

read the original abstract

We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NAM gives a workable Bayesian nonparametric way to jointly cluster individuals by genotype and cells by expression at scale, but the real-data results do not yet show the genotype layer is necessary for the claimed biological alignment.

read the letter

The main takeaway is that this paper builds a new model, the Nested Atoms Model, that does two-level clustering while bringing in group-level covariates like genotypes alongside cell-level gene expression. It targets exactly the kind of hierarchical single-cell data now common in large cohorts, and the authors scale it with variational inference to handle over a million cells from nearly a thousand individuals. Simulations indicate it beats methods that drop the group-level information, which is a reasonable check on the modeling choice. On the OneK1K data the cell clusters line up with known immune types through differential expression, which is at least directionally encouraging for the application area. That is the concrete advance: a BNP construction that explicitly folds group covariates into the nested partition rather than treating groups as exchangeable only. The scaling work is also practical and necessary for this data size. The soft spot is the real-data section. The headline result—that NAM finds genetically similar individuals with homogeneous cell profiles whose cell clusters match biology—rests on post-hoc differential expression without an ablation that removes the genotype layer or quantitative agreement metrics against expert cell-type labels. It is therefore unclear whether the joint model changes the cell partitions in a way that matters or whether expression data alone would produce similar DE results. The stress-test note captures this accurately; the abstract supplies no permutation test, no adjusted Rand index, and no direct comparison isolating the genotype contribution. This is not a fatal gap, but it is the part that needs tightening before the biological claim can be taken as strong evidence for the modeling innovation. The paper is aimed at statistical geneticists and computational biologists who work with multi-level single-cell data and want a nonparametric joint clustering tool. A reader already comfortable with Dirichlet process mixtures and variational methods will get the most out of the model construction and the scaling details. It is coherent enough and addresses a real methodological gap, so it deserves a serious referee even if revisions on validation are required. I would send it out for review with a request for clearer quantitative checks on whether the nested genotype component is load-bearing for the cell-type findings.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Nested Atoms Model (NAM), a Bayesian nonparametric approach for joint two-layer clustering of nested data with group-level variables (e.g., individual genotypes) and observation-level variables (e.g., single-cell gene expressions). It develops a scalable variational Bayesian inference algorithm and applies the model to the OneK1K dataset (982 individuals, 1.27M cells). Simulations are reported to show outperformance over methods that ignore group-level variables, while the real-data analysis claims that NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles whose cell clusters align with known immune cell types via differential gene expression.

Significance. If the central claims hold, NAM would fill a methodological gap in grouped clustering by incorporating group-level covariates into a nonparametric framework, enabling joint inference on genetic and cellular heterogeneity at population scale. The variational inference procedure is a practical strength for big data. However, the significance is limited by the absence of quantitative evidence that the genotype layer contributes meaningfully to the reported biological alignments beyond what expression-only clustering would achieve.

major comments (2)

[Application to OneK1K dataset] Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.
[Simulation studies] Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.

minor comments (2)

[Abstract and Introduction] The abstract and introduction could more explicitly define the variational approximation (e.g., mean-field assumptions or ELBO terms) rather than deferring all details to the methods section.
[Model specification] Notation for the nested atoms and atom assignments is introduced without a clear summary table or diagram, which would aid readability for readers unfamiliar with nested nonparametric models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, with revisions proposed where they improve clarity and evidence without altering the manuscript's core claims.

read point-by-point responses

Referee: Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.

Authors: We agree that additional quantitative support would strengthen the demonstration that the genotype layer contributes to the observed alignments. In the revised manuscript we will add an ablation experiment fitting an expression-only variant of NAM and comparing its cell clusters to the full model via adjusted Rand index. We will also report a permutation-based test of marker-gene enrichment strength and enrichment scores against known immune cell-type markers. Full cell-level expert annotations are unavailable for the full 1.27 M cells, so direct ARI against ground truth is not feasible; the between-model ARI and permutation tests nevertheless provide quantitative evidence that the nested structure is load-bearing. revision: yes
Referee: Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.

Authors: We thank the referee for highlighting the need for greater reproducibility. The simulation section already specifies a nested generative process that draws group-level genotype clusters from a Dirichlet process and then observation-level expression clusters conditional on group membership, with controlled noise levels. To make this fully verifiable we will expand the section to state the exact generative parameters, report ARI and normalized mutual information for both the group and observation layers, and list the precise baseline implementations (including package versions and hyper-parameter settings). These additions will allow readers to confirm that the reported gains isolate the benefit of the group-level variables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in NAM derivation

full rationale

The paper introduces NAM as a novel Bayesian nonparametric model for joint clustering of nested data with group-level genotypes and observation-level expressions, develops a new variational inference algorithm to scale it, and validates via simulations (outperforming baselines that ignore group variables) plus application to OneK1K data where cell clusters align with known immune types via differential expression. No derivation step reduces by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims rest on the new model structure and external simulation/biological checks rather than tautological renaming or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; the model is described only at the level of a new Bayesian nonparametric approach with variational inference.

pith-pipeline@v0.9.0 · 5564 in / 1017 out tokens · 24216 ms · 2026-05-10T15:24:28.227849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

, J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1,

Forj= 1, . . . , J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1, . . . , K, logρ jk =g(¯ak,¯bk) + k−1X r=1 g(¯br,¯ar) + LX l=1 " njX i=1 ξjil !( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) )# + + 1 2 ℓ(x,1) k +ℓ (x,2) jk , where g(x, y) =ψ(x)−ψ(x+y),withψdenoting the digamma function, ℓ(x,1) k = qX i=1 ψ((c x k −i+ 1)/2) +qlog 2 + log|D x ...

work page
[2]

, Jandi= 1,

Forj= 1, . . . , Jandi= 1, . . . , n j,q ⋆(Mji) is aL-dimensional multinomial, withq ⋆(Mji =l) =ξ jil forl= 1, . . . , L, logξ jil = 1 2 ℓ(y,1) l +ℓ (y,2) jil + KX k=1 ρjk ( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) ) , where ℓ(y,1) l = pX i=1 ψ (cy l −i+ 1)/2 +plog 2 + log|D y l |and ℓ(y,2) jil =−p/t y l −c y l (yji −m y l )T Dy l (yji −m y l )

work page
[3]

, Kandl= 1,

Fork= 1, . . . , Kandl= 1, . . . , L−1,q ⋆(ulk) is aBeta(¯alk,¯blk) distribution with ¯alk = 1 + JX j=1 njX i=1 ξjit , ¯blk =r 1/r2 + JX j=1   ρjk   njX i=1 LX t=l+1 ξjit      . 41

work page
[4]

, K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

Fork= 1, . . . , K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

work page
[5]

Fork= 1, . . . , K,q ⋆(µx k,Λ x k) is a NW(m x k, tx k, cx k,D x k) distribution with parameters mx k =t x−1 k (λx 0 µx 0 +N x k ¯xk), t x k =λ x 0 +N x k , c x k =ν x 0 +N x k , Dx−1 k =Ψ x−1 0 + λx 0 N x k λx 0 +N x k n (¯xk −µ x

work page
[6]

(¯xk −µ x 0)T o +S x k, where N x k = JX j=1 ρjk ,¯x k =N x−1 k   JX j=1 ρjk xj   , Sx k = JX j=1 ρjk n (xj −¯xk) (xj −¯xk)T o

work page
[7]

logp LY l=1 {µy l ,Λ y l } !# =LlogB(Ψ y 0, νy 0 ) + 1 2 n (νy 0 −p−1) LX l=1 ℓ(y,1) l o − 1 2 n LX l=1 cy l T(Ψ y−1 0 Dy l ) o + 1 2

Forl= 1, . . . , L,q ⋆(µy l ,Λ y l ) is a NW(m y l , ty l , cy l ,D y l ) distribution with parameters my l =t y−1 l (λy 0 µy 0 +N y l ¯yl), t y l =λ y 0 +N y l , c y l =ν y 0 +N y l , Dy−1 l =Ψ y−1 0 + λy 0 N y l λy 0 +N y l n ¯yl −µ y 0 ¯yl −µ y 0 To +S y l , where N y l = JX j=1 njX i=1 ξjil ,¯y l =N y−1 l JX j=1 njX i=1 ξjil yji , Sy l = JX j=1 njX i=...

work page 2006

[1] [1]

, J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1,

Forj= 1, . . . , J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1, . . . , K, logρ jk =g(¯ak,¯bk) + k−1X r=1 g(¯br,¯ar) + LX l=1 " njX i=1 ξjil !( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) )# + + 1 2 ℓ(x,1) k +ℓ (x,2) jk , where g(x, y) =ψ(x)−ψ(x+y),withψdenoting the digamma function, ℓ(x,1) k = qX i=1 ψ((c x k −i+ 1)/2) +qlog 2 + log|D x ...

work page

[2] [2]

, Jandi= 1,

Forj= 1, . . . , Jandi= 1, . . . , n j,q ⋆(Mji) is aL-dimensional multinomial, withq ⋆(Mji =l) =ξ jil forl= 1, . . . , L, logξ jil = 1 2 ℓ(y,1) l +ℓ (y,2) jil + KX k=1 ρjk ( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) ) , where ℓ(y,1) l = pX i=1 ψ (cy l −i+ 1)/2 +plog 2 + log|D y l |and ℓ(y,2) jil =−p/t y l −c y l (yji −m y l )T Dy l (yji −m y l )

work page

[3] [3]

, Kandl= 1,

Fork= 1, . . . , Kandl= 1, . . . , L−1,q ⋆(ulk) is aBeta(¯alk,¯blk) distribution with ¯alk = 1 + JX j=1 njX i=1 ξjit , ¯blk =r 1/r2 + JX j=1   ρjk   njX i=1 LX t=l+1 ξjit      . 41

work page

[4] [4]

, K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

Fork= 1, . . . , K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

work page

[5] [5]

Fork= 1, . . . , K,q ⋆(µx k,Λ x k) is a NW(m x k, tx k, cx k,D x k) distribution with parameters mx k =t x−1 k (λx 0 µx 0 +N x k ¯xk), t x k =λ x 0 +N x k , c x k =ν x 0 +N x k , Dx−1 k =Ψ x−1 0 + λx 0 N x k λx 0 +N x k n (¯xk −µ x

work page

[6] [6]

(¯xk −µ x 0)T o +S x k, where N x k = JX j=1 ρjk ,¯x k =N x−1 k   JX j=1 ρjk xj   , Sx k = JX j=1 ρjk n (xj −¯xk) (xj −¯xk)T o

work page

[7] [7]

logp LY l=1 {µy l ,Λ y l } !# =LlogB(Ψ y 0, νy 0 ) + 1 2 n (νy 0 −p−1) LX l=1 ℓ(y,1) l o − 1 2 n LX l=1 cy l T(Ψ y−1 0 Dy l ) o + 1 2

Forl= 1, . . . , L,q ⋆(µy l ,Λ y l ) is a NW(m y l , ty l , cy l ,D y l ) distribution with parameters my l =t y−1 l (λy 0 µy 0 +N y l ¯yl), t y l =λ y 0 +N y l , c y l =ν y 0 +N y l , Dy−1 l =Ψ y−1 0 + λy 0 N y l λy 0 +N y l n ¯yl −µ y 0 ¯yl −µ y 0 To +S y l , where N y l = JX j=1 njX i=1 ξjil ,¯y l =N y−1 l JX j=1 njX i=1 ξjil yji , Sy l = JX j=1 njX i=...

work page 2006