ZINBGT: Exploratory Data Analysis of Single-Cell Transcriptomic Expression Using Mixture Models

Mayetri Gupta; Thomas D. Otto; Toby Kettlewell; Vincent Macaulay; Yiyi Cheng

arxiv: 2604.09319 · v1 · submitted 2026-04-10 · 📊 stat.AP

ZINBGT: Exploratory Data Analysis of Single-Cell Transcriptomic Expression Using Mixture Models

Toby Kettlewell , Yiyi Cheng , Thomas D. Otto , Vincent Macaulay , Mayetri Gupta This is my paper

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 📊 stat.AP

keywords single-cell RNA-seqmixture modelszero-inflated negative binomialWasserstein distanceoutlier detectionexploratory data analysisgene expression visualizationcount data modeling

0 comments

The pith

ZINBGT mixture model fits single-cell gene counts to yield interpretable visualizations and Wasserstein-based outlier diagnostics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZINBGT as a new mixture model for single-cell transcriptomic counts that combines a zero-inflated negative binomial component with a geometric tail. This fitting produces per-gene visualizations of expression patterns across cells and supplies diagnostic plots that use Wasserstein distance to flag genes whose observed counts deviate from the fitted model. The approach is positioned as an alternative to subjective low-dimensional embeddings such as t-SNE or UMAP, supplying instead statistically grounded statements about technical noise and biological variation. Applications to a T. brucei dataset and a human immune-cell dataset illustrate how the diagnostics can surface outlier genes and relationships among sparsity, mean, and dispersion, while also exposing shortcomings in standard zero-inflated negative binomial assumptions and in current simulation procedures. A reader cares because noisy single-cell count data still lack validated processing pipelines, so methods that quantify model fit and highlight anomalies directly aid reliable downstream analysis.

Core claim

ZINBGT is introduced as a mixture model that captures zero inflation together with heavy-tailed count behavior in single-cell RNA data; fitting the model to each gene produces direct visualizations of its expression distribution across cells and enables Wasserstein distance calculations that identify outlier genes whose empirical distribution departs from the model expectation.

What carries the argument

ZINBGT, the zero-inflated negative binomial distribution augmented with a geometric tail, whose fitted parameters generate per-gene expression plots and serve as the reference distribution for Wasserstein distance outlier scoring.

If this is right

Outlier genes become detectable in real datasets such as the T. brucei sample through Wasserstein distance summaries.
Sparsity, mean expression, and dispersion relationships can be quantified across thousands of genes in immune-cell data.
Limitations of plain zero-inflated negative binomial models for single-cell counts are made visible by systematic lack of fit.
Common simulation procedures for single-cell data can be shown to deviate from ground truth, limiting their use for method validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-gene mixture-model fitting and Wasserstein scoring could be applied to other sparse count data such as microbiome or proteomics profiles.
If the diagnostics prove robust, they could serve as an automated quality-control step before clustering or differential-expression tests.
The method supplies a quantitative basis for comparing alternative preprocessing pipelines by measuring how well each pipeline’s output conforms to the ZINBGT shape.

Load-bearing premise

The ZINBGT distribution faithfully represents the combined technical and biological sources of variation in single-cell counts, so that large Wasserstein distances truly indicate meaningful outliers instead of model misspecification.

What would settle it

Re-fitting ZINBGT to a dataset containing independently validated outlier genes (or to simulated counts generated from a known different distribution) and checking whether the Wasserstein scores systematically flag or miss those known outliers would directly test the diagnostic reliability.

Figures

Figures reproduced from arXiv: 2604.09319 by Mayetri Gupta, Thomas D. Otto, Toby Kettlewell, Vincent Macaulay, Yiyi Cheng.

**Figure 2.** Figure 2: Ternary plot of the estimates of (p0, p1, p2) for each gene in the CD14 monocyte sample. Each small triangle represents a range of p0, p1 and consequently p2 values, with its colour representing the number of genes with parameter estimates within that range. fitted with a geometric component, and the curved ridge on the ternary plot reflects a tendency for the geometric component to be more prominent relat… view at source ↗

**Figure 3.** Figure 3: Two-dimensional histograms of p0 against m, d, and µg for the genes in the CD14 dataset. Cyan dot-dashed lines separate boundary values from non-boundary values. E.g., the genes above the top horizontal lines are those for which p0 is 1, whereas those just below have a p0 value close to, but strictly smaller than, 1. 3.2 CD14 and its simulations We next used visualisations comparing the CD14 data to two si… view at source ↗

**Figure 4.** Figure 4: Scatter plots of each gene’s mean RNA count with the Wasserstein distance between fitted [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Two-dimensional histograms comparing p0 and m for genes in the samples simulated by muscat and Hierarchicell intended to replicate the real CD14 data [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram of count values observed for the outlier gene in a [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Single-cell transcriptomic data approximates the abundance of proteins at a high resolution, but its noisiness necessitates transformation by a pipeline of methods before analysis and inference. In the absence of robust validation of these pipelines and methods, it remains unclear how best to process any particular dataset. To compensate for this, popular visualisation methods, e.g., t-SNE and UMAP, are commonly used to produce descriptions of datasets. Such visualisations are incomplete and provide subjective descriptions of samples rather than statistically meaningful statements about technical noise or biology. In this paper, we introduce the Zero-Inflated Negative-Binomial with Geometric Tail (ZINBGT), a mixture-model-based strategy for producing interpretable visualisations of each gene's expression across cells, along with diagnostic summaries that use Wasserstein distance to highlight outlier genes. These diagnostics are used to reveal an outlier gene within a T. brucei sample. This method is applied to a human immune-cell dataset, highlighting the relationship between sparsity, mean, and spread across genes, as well as revealing an issue with the use of zero-inflated negative-binomial distributions to model single-cell RNA data. An investigation of simulated datasets intended to replicate the immune-cell data revealed discrepancies with the ground truth, establishing purposes for which these simulated datasets are unsuitable. Finally, we list a number of different domains to which this method can be applied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZINBGT adds a geometric tail to ZINB for per-gene scRNA modeling and uses Wasserstein to flag outliers, but the fit quality and biological meaning of those flags rest on limited checks.

read the letter

The paper introduces ZINBGT, a zero-inflated negative binomial mixed with a geometric tail, as a way to model expression counts for each gene across cells. They fit it per gene, produce visualizations of the fitted distributions, and use Wasserstein distance between the empirical and model distributions to highlight outlier genes. On a T. brucei dataset this flags one gene; on human immune cell data it maps relationships between sparsity, mean, and spread while showing that plain ZINB models leave systematic problems; the same approach on simulated data reveals clear mismatches with the real patterns they were meant to reproduce.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Zero-Inflated Negative-Binomial with Geometric Tail (ZINBGT) mixture model for exploratory analysis of single-cell transcriptomic counts. It generates per-gene visualizations of expression across cells and employs Wasserstein distance between empirical and fitted distributions as a diagnostic to flag outlier genes. Applications include identifying an outlier in a T. brucei dataset, exposing sparsity-mean-spread relationships and limitations of standard ZINB models in immune-cell data, and documenting discrepancies between simulated data and ground truth, with suggestions for use in other domains.

Significance. If the ZINBGT model provides reliable fits and the Wasserstein diagnostics isolate biologically meaningful outliers rather than artifacts, the approach could supply a statistically interpretable alternative to subjective visualizations such as t-SNE or UMAP. The explicit demonstration of ZINB shortcomings and simulation mismatches could help refine preprocessing pipelines in scRNA-seq, provided the claims rest on reproducible quantitative validation.

major comments (3)

[Abstract] Abstract: the assertion that ZINBGT diagnostics reveal an issue with zero-inflated negative-binomial models for single-cell RNA data lacks accompanying goodness-of-fit metrics, likelihood comparisons, or posterior predictive checks; without these, it is impossible to separate model misspecification from genuine data features that the geometric tail is intended to address.
[Simulated datasets application] Section describing the simulated datasets: the reported discrepancies with ground truth are presented as establishing limitations of such simulations, yet the manuscript provides neither the precise simulation parameters chosen to replicate the immune-cell data nor quantitative details (e.g., which moments, zero proportions, or tail quantiles differ), rendering the claim about unsuitability difficult to evaluate or reproduce.
[T. brucei sample analysis] Section on T. brucei outlier detection: the Wasserstein distance is used to highlight an outlier gene under the assumption that ZINBGT faithfully captures both zero-inflation and tail behavior; given the paper's own critique of ZINB on the same data type, an explicit sensitivity analysis or comparison of fitted versus empirical tail probabilities is needed to confirm the distance reflects biology rather than residual misspecification.

minor comments (2)

[Abstract] The abstract states that the method can be applied to 'a number of different domains' but does not enumerate them; adding a brief list would strengthen the broader-impact statement.
[Methods] Notation for the ZINBGT mixture weights, success probabilities, and geometric-tail parameter should be introduced with explicit equations at first use to improve readability for readers unfamiliar with zero-inflated count models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight areas where additional rigor can strengthen the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that ZINBGT diagnostics reveal an issue with zero-inflated negative-binomial models for single-cell RNA data lacks accompanying goodness-of-fit metrics, likelihood comparisons, or posterior predictive checks; without these, it is impossible to separate model misspecification from genuine data features that the geometric tail is intended to address.

Authors: We agree that the current presentation relies primarily on visual inspection and Wasserstein distances. To better substantiate the claim of model limitations, the revised manuscript will include likelihood comparisons between fitted ZINB and ZINBGT models on the immune-cell data, along with posterior predictive checks that quantify discrepancies in zero proportions and tail behavior. revision: yes
Referee: [Simulated datasets application] Section describing the simulated datasets: the reported discrepancies with ground truth are presented as establishing limitations of such simulations, yet the manuscript provides neither the precise simulation parameters chosen to replicate the immune-cell data nor quantitative details (e.g., which moments, zero proportions, or tail quantiles differ), rendering the claim about unsuitability difficult to evaluate or reproduce.

Authors: The manuscript describes the simulation strategy at a high level but does not tabulate the exact parameter values or provide side-by-side quantitative metrics. We will revise this section to report the full simulation parameters and include explicit comparisons of means, variances, zero-inflation rates, and upper-tail quantiles between the simulated data and the original immune-cell dataset. revision: yes
Referee: [T. brucei sample analysis] Section on T. brucei outlier detection: the Wasserstein distance is used to highlight an outlier gene under the assumption that ZINBGT faithfully captures both zero-inflation and tail behavior; given the paper's own critique of ZINB on the same data type, an explicit sensitivity analysis or comparison of fitted versus empirical tail probabilities is needed to confirm the distance reflects biology rather than residual misspecification.

Authors: Although ZINBGT was constructed to improve tail capture relative to standard ZINB, we accept that an explicit check is warranted. The revision will add, for the flagged T. brucei gene, a direct overlay of empirical versus ZINBGT-fitted tail probabilities together with a sensitivity analysis that perturbs the geometric-tail parameter and recomputes the Wasserstein distance to assess robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: ZINBGT model and Wasserstein diagnostics derive new observations from external data

full rationale

The paper defines the ZINBGT mixture model as an extension for single-cell count data, fits it per gene, and computes Wasserstein distances between empirical and model distributions to flag outliers or relationships in real (T. brucei, immune-cell) and simulated datasets. No quoted derivation step reduces a reported prediction, diagnostic, or visualization to a fitted parameter by construction, nor does any central claim rest on a self-citation chain or imported uniqueness theorem. All highlighted findings (outlier gene, sparsity-mean-spread patterns, simulation discrepancies) are obtained by applying the defined procedure to independent external data, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that single-cell count data follow a zero-inflated negative binomial with geometric tail and that Wasserstein distance on the fitted distributions yields biologically or technically meaningful outlier flags. Parameters of the mixture components must be estimated from each gene's data.

free parameters (1)

ZINBGT mixture parameters
Zero-inflation probability, negative-binomial dispersion and mean, and geometric tail parameter are estimated per gene from the observed counts.

axioms (1)

domain assumption Single-cell transcriptomic expression counts for each gene are adequately described by a zero-inflated negative binomial distribution augmented with a geometric tail
This distributional form is the core modeling choice that enables the visualizations and Wasserstein diagnostics.

pith-pipeline@v0.9.0 · 5560 in / 1391 out tokens · 70310 ms · 2026-05-10T16:31:56.194798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

L., Soneson, C., Germain, P.-L.,et al

Crowell, H. L., Soneson, C., Germain, P.-L.,et al. (2020).Muscatdetects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data.Nat. Commun.,11,

work page 2020
[2]

H.,et al.(2024)

Hao, Y., Stuart, T., Kowalski, M. H.,et al.(2024). Dictionary learning for integrative, multimodal and scalable single-cell analysis.Nat. Biotechnol.,42, 293–304. Murphy, A. E. and Skene, N. G. (2022). A balanced measure shows superior performance of pseudobulk methods in single-cell RNA-sequencing analysis.Nat. Commun.,13,

work page 2024
[3]

Zimmerman, K. D. and Langefeld, C. D. (2021).Hierarchicell: An R-package for estimating power for tests of differential expression with single-cell data.BMC Genomics,22,

work page 2021
[4]

D., Espeland, M

Zimmerman, K. D., Espeland, M. A., and Langefeld, C. D. (2021). A practical solution to pseudorepli- cation bias in single-cell studies.Nat. Commun.,12,

work page 2021
[5]

Between these and the bivariate histograms are univariate histograms for the non-boundary values of each parameter

The bars to the right show the proportion of genes on each parameter’s boundaries. Between these and the bivariate histograms are univariate histograms for the non-boundary values of each parameter. Parameters generally take boundary values as a consequence of model simplification, therefore each component’s proportion of boundary values will reflect how ...

work page 2021

[1] [1]

L., Soneson, C., Germain, P.-L.,et al

Crowell, H. L., Soneson, C., Germain, P.-L.,et al. (2020).Muscatdetects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data.Nat. Commun.,11,

work page 2020

[2] [2]

H.,et al.(2024)

Hao, Y., Stuart, T., Kowalski, M. H.,et al.(2024). Dictionary learning for integrative, multimodal and scalable single-cell analysis.Nat. Biotechnol.,42, 293–304. Murphy, A. E. and Skene, N. G. (2022). A balanced measure shows superior performance of pseudobulk methods in single-cell RNA-sequencing analysis.Nat. Commun.,13,

work page 2024

[3] [3]

Zimmerman, K. D. and Langefeld, C. D. (2021).Hierarchicell: An R-package for estimating power for tests of differential expression with single-cell data.BMC Genomics,22,

work page 2021

[4] [4]

D., Espeland, M

Zimmerman, K. D., Espeland, M. A., and Langefeld, C. D. (2021). A practical solution to pseudorepli- cation bias in single-cell studies.Nat. Commun.,12,

work page 2021

[5] [5]

Between these and the bivariate histograms are univariate histograms for the non-boundary values of each parameter

The bars to the right show the proportion of genes on each parameter’s boundaries. Between these and the bivariate histograms are univariate histograms for the non-boundary values of each parameter. Parameters generally take boundary values as a consequence of model simplification, therefore each component’s proportion of boundary values will reflect how ...

work page 2021