pith. sign in

arxiv: 2604.11731 · v1 · submitted 2026-04-13 · 📊 stat.ME · stat.AP· stat.ML

Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 📊 stat.ME stat.APstat.ML
keywords nested clusteringBayesian nonparametricsingle-cell RNA-seqgenotype datavariational inferenceimmune cell typeshierarchical datapopulation-scale data
0
0 comments X

The pith

The Nested Atoms Model jointly clusters genetically similar individuals and their cells by gene expression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses clustering of nested data where observations are grouped, with variables at both the group level like individual genotypes and the observation level like cell gene expressions. Existing methods for grouped clustering omit group-level variables, limiting their ability to capture how genetic differences among people shape their cell-type profiles in large single-cell datasets. NAM is introduced as a Bayesian nonparametric model that performs simultaneous two-layered clustering of individuals and cells while using both data types. A variational Bayesian algorithm scales the model to high-dimensional data with over a million cells. When applied to the OneK1K dataset, the resulting clusters of genetically similar individuals show homogeneous cell profiles that match known immune cell types through differential gene expression analysis.

Core claim

The Nested Atoms Model is a Bayesian nonparametric approach for two-layered clustering of nested data that jointly models group-level variables such as individual-specific genotypes and observation-level variables such as cell-specific gene expressions. It defines cluster assignments at both the individual level and the cell level within individuals to capture heterogeneity across scales. A fast variational Bayesian inference procedure is developed to handle large-scale high-dimensional data. Simulations show improved performance over methods that ignore group-level information, and application to the OneK1K single-cell dataset with 982 individuals and 1.27 million cells produces cell-type–s

What carries the argument

The Nested Atoms Model, a Bayesian nonparametric model that performs simultaneous clustering of groups using group-level variables and of observations within groups using observation-level variables.

If this is right

  • NAM outperforms existing grouped clustering methods that ignore group-level variables in simulation studies.
  • The variational inference algorithm allows scaling to population-scale datasets with over a million cells.
  • Application to the OneK1K dataset yields clusters of genetically similar individuals whose cell profiles are homogeneous and align with known immune cell types.
  • The model enables investigation of how genetic variations among individuals influence differences in cell-type profiles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-level structure could be adapted to other nested biological datasets, such as patients with multiple tissue samples.
  • Successful alignment with known cell types suggests the model might help discover new genetically influenced cell subtypes.
  • Adding covariates at either the group or observation level would extend the framework to more complex multi-omics settings.

Load-bearing premise

That jointly modeling group-level genotype variables together with observation-level gene expression will produce clusters that meaningfully capture heterogeneity at both levels and align with external biological knowledge.

What would settle it

If the cell clusters identified by NAM on the OneK1K data do not show differential gene expression patterns consistent with known immune cell types, or if simulations show no performance gain over methods that ignore group-level variables.

Figures

Figures reproduced from arXiv: 2604.11731 by Arhit Chakrabarti, Bani K. Mallick, Yang Ni, Yuchao Jiang.

Figure 1
Figure 1. Figure 1: Illustrative figure showing the inherent groups of individuals based on shared genetic variation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of group-clustering accuracy of NAM with CAM and fiSAN against varying dimensions [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top row: distributions of the (a) median and (b) maximum computing time with varying dimen [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Observational clustering for selected individuals by estimated group clusters (GCs). The colors [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Observational clustering for the same individuals by estimated group clusters (GCs). The colors [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Boxplot of gene expressions in the top six differentially expressed genes for selected individuals by [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Heatmaps of the top 20 differentially expressed genes obtained from pseudobulk differential ex [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Violin plots of the log counts per million (CPM) for the gene MS4A1, comparing expression levels [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Barplot summarizing the Gene Ontology enrichment analysis for the genes differentially expressed [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
read the original abstract

We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Nested Atoms Model (NAM), a Bayesian nonparametric approach for joint two-layer clustering of nested data with group-level variables (e.g., individual genotypes) and observation-level variables (e.g., single-cell gene expressions). It develops a scalable variational Bayesian inference algorithm and applies the model to the OneK1K dataset (982 individuals, 1.27M cells). Simulations are reported to show outperformance over methods that ignore group-level variables, while the real-data analysis claims that NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles whose cell clusters align with known immune cell types via differential gene expression.

Significance. If the central claims hold, NAM would fill a methodological gap in grouped clustering by incorporating group-level covariates into a nonparametric framework, enabling joint inference on genetic and cellular heterogeneity at population scale. The variational inference procedure is a practical strength for big data. However, the significance is limited by the absence of quantitative evidence that the genotype layer contributes meaningfully to the reported biological alignments beyond what expression-only clustering would achieve.

major comments (2)
  1. [Application to OneK1K dataset] Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.
  2. [Simulation studies] Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction could more explicitly define the variational approximation (e.g., mean-field assumptions or ELBO terms) rather than deferring all details to the methods section.
  2. [Model specification] Notation for the nested atoms and atom assignments is introduced without a clear summary table or diagram, which would aid readability for readers unfamiliar with nested nonparametric models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, with revisions proposed where they improve clarity and evidence without altering the manuscript's core claims.

read point-by-point responses
  1. Referee: Application to OneK1K dataset (real-data results section): the claim that cell clusters 'align with known immune cell types based on differential gene expression' is supported only by post-hoc DE analysis; no quantitative metrics (adjusted Rand index against expert annotations, permutation test of marker strength, or ablation removing the genotype layer) are provided to demonstrate that the nested structure is load-bearing rather than the scRNA-seq marginal alone.

    Authors: We agree that additional quantitative support would strengthen the demonstration that the genotype layer contributes to the observed alignments. In the revised manuscript we will add an ablation experiment fitting an expression-only variant of NAM and comparing its cell clusters to the full model via adjusted Rand index. We will also report a permutation-based test of marker-gene enrichment strength and enrichment scores against known immune cell-type markers. Full cell-level expert annotations are unavailable for the full 1.27 M cells, so direct ARI against ground truth is not feasible; the between-model ARI and permutation tests nevertheless provide quantitative evidence that the nested structure is load-bearing. revision: yes

  2. Referee: Simulation studies section: while outperformance is asserted, the manuscript does not report the specific generative process used to simulate nested group- and observation-level structure or the exact metrics (e.g., ARI, clustering accuracy) and baseline implementations, making it impossible to verify that the comparison fairly isolates the benefit of incorporating group-level variables.

    Authors: We thank the referee for highlighting the need for greater reproducibility. The simulation section already specifies a nested generative process that draws group-level genotype clusters from a Dirichlet process and then observation-level expression clusters conditional on group membership, with controlled noise levels. To make this fully verifiable we will expand the section to state the exact generative parameters, report ARI and normalized mutual information for both the group and observation layers, and list the precise baseline implementations (including package versions and hyper-parameter settings). These additions will allow readers to confirm that the reported gains isolate the benefit of the group-level variables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in NAM derivation

full rationale

The paper introduces NAM as a novel Bayesian nonparametric model for joint clustering of nested data with group-level genotypes and observation-level expressions, develops a new variational inference algorithm to scale it, and validates via simulations (outperforming baselines that ignore group variables) plus application to OneK1K data where cell clusters align with known immune types via differential expression. No derivation step reduces by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claims rest on the new model structure and external simulation/biological checks rather than tautological renaming or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; the model is described only at the level of a new Bayesian nonparametric approach with variational inference.

pith-pipeline@v0.9.0 · 5564 in / 1017 out tokens · 24216 ms · 2026-05-10T15:24:28.227849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    , J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1,

    Forj= 1, . . . , J,q ⋆(Sj) is aK-dimensional multinomial, withq ⋆(Sj =k) =ρ jk fork= 1, . . . , K, logρ jk =g(¯ak,¯bk) + k−1X r=1 g(¯br,¯ar) + LX l=1 " njX i=1 ξjil !( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) )# + + 1 2 ℓ(x,1) k +ℓ (x,2) jk , where g(x, y) =ψ(x)−ψ(x+y),withψdenoting the digamma function, ℓ(x,1) k = qX i=1 ψ((c x k −i+ 1)/2) +qlog 2 + log|D x ...

  2. [2]

    , Jandi= 1,

    Forj= 1, . . . , Jandi= 1, . . . , n j,q ⋆(Mji) is aL-dimensional multinomial, withq ⋆(Mji =l) =ξ jil forl= 1, . . . , L, logξ jil = 1 2 ℓ(y,1) l +ℓ (y,2) jil + KX k=1 ρjk ( g(¯alk,¯blk) + l−1X t=1 g(¯btk,¯atk) ) , where ℓ(y,1) l = pX i=1 ψ (cy l −i+ 1)/2 +plog 2 + log|D y l |and ℓ(y,2) jil =−p/t y l −c y l (yji −m y l )T Dy l (yji −m y l )

  3. [3]

    , Kandl= 1,

    Fork= 1, . . . , Kandl= 1, . . . , L−1,q ⋆(ulk) is aBeta(¯alk,¯blk) distribution with ¯alk = 1 + JX j=1 njX i=1 ξjit , ¯blk =r 1/r2 + JX j=1   ρjk   njX i=1 LX t=l+1 ξjit      . 41

  4. [4]

    , K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

    Fork= 1, . . . , K−1,q ⋆(vk) is aBeta(¯ak,¯bk) distribution with ¯ak = 1 + JX j=1 ρjk , ¯bk =s 1/s2 + JX j=1 KX l=k+1 ρjl

  5. [5]

    Fork= 1, . . . , K,q ⋆(µx k,Λ x k) is a NW(m x k, tx k, cx k,D x k) distribution with parameters mx k =t x−1 k (λx 0 µx 0 +N x k ¯xk), t x k =λ x 0 +N x k , c x k =ν x 0 +N x k , Dx−1 k =Ψ x−1 0 + λx 0 N x k λx 0 +N x k n (¯xk −µ x

  6. [6]

    (¯xk −µ x 0)T o +S x k, where N x k = JX j=1 ρjk ,¯x k =N x−1 k   JX j=1 ρjk xj   , Sx k = JX j=1 ρjk n (xj −¯xk) (xj −¯xk)T o

  7. [7]

    logp LY l=1 {µy l ,Λ y l } !# =LlogB(Ψ y 0, νy 0 ) + 1 2 n (νy 0 −p−1) LX l=1 ℓ(y,1) l o − 1 2 n LX l=1 cy l T(Ψ y−1 0 Dy l ) o + 1 2

    Forl= 1, . . . , L,q ⋆(µy l ,Λ y l ) is a NW(m y l , ty l , cy l ,D y l ) distribution with parameters my l =t y−1 l (λy 0 µy 0 +N y l ¯yl), t y l =λ y 0 +N y l , c y l =ν y 0 +N y l , Dy−1 l =Ψ y−1 0 + λy 0 N y l λy 0 +N y l n ¯yl −µ y 0 ¯yl −µ y 0 To +S y l , where N y l = JX j=1 njX i=1 ξjil ,¯y l =N y−1 l JX j=1 njX i=1 ξjil yji , Sy l = JX j=1 njX i=...