pith. sign in

arxiv: 2606.07760 · v1 · pith:BLSAENSYnew · submitted 2026-06-05 · 💻 cs.LG

scCBGM: Interpretable Single-Cell Counterfactual Editing

Pith reviewed 2026-06-27 22:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords single-cell RNA sequencingcounterfactual editingconcept bottleneck modelsgenerative modelsflow matchingcombinatorial generalizationperturbation predictioninterpretable machine learning
0
0 comments X

The pith

scCBGM adapts concept bottleneck models with skip connections and a cross-covariance penalty to enable interpretable counterfactual editing of single cells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces single-cell Concept Bottleneck Generative Models to edit gene expression profiles of individual cells in an interpretable manner. It modifies concept bottleneck architectures for single-cell RNA data by adding decoder skip connections and a cross-covariance penalty that keeps concepts disentangled without forcing fixed dimensions. The framework extends to flow matching models so editing works in both reconstruction and pure generation settings. Evaluation relies on a new synthetic benchmark that supplies ground-truth counterfactuals plus population-level checks on real datasets, where the method shows stronger combinatorial generalization and counterfactual accuracy than prior approaches.

Core claim

scCBGM is a framework that adapts concept bottleneck generative models to single-cell RNA sequencing data through decoder skip connections and a cross-covariance penalty for disentanglement, then extends the approach to flow matching models, thereby supporting precise concept-guided counterfactual editing in both encoding-decoding and generation regimes while outperforming baselines on combinatorial generalization tasks.

What carries the argument

The concept bottleneck architecture adapted via decoder skip connections and a cross-covariance penalty that promotes disentanglement in the learned concept representations.

If this is right

  • Concept-guided editing becomes possible in both reconstruction and unconditional generation regimes through the flow-matching extension.
  • Combinatorial generalization improves across multiple real single-cell datasets compared with prior methods.
  • Cell-level validation on synthetic data with known counterfactuals becomes feasible alongside population-level benchmarks.
  • Disentangled concepts allow separate control of distinct biological factors during editing without dimensional restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce reliance on exhaustive wet-lab perturbation screens by simulating responses to untested condition combinations.
  • Disentangled concepts might map onto known biological pathways, letting users inspect which factors drive a given edit.
  • Scaling the method to multimodal data such as paired RNA and protein measurements would test whether the same skip-connection and penalty design transfers.
  • If the cross-covariance term remains effective at larger concept counts, the framework could support finer-grained editing than dimension-constrained alternatives.

Load-bearing premise

The synthetic benchmark with ground-truth counterfactuals accurately reflects the complexities and noise of real single-cell data, and the cross-covariance penalty effectively promotes disentanglement without introducing new constraints.

What would settle it

An independent set of real perturbation experiments where the cell-level counterfactual predictions generated by scCBGM systematically diverge from the measured post-perturbation expression profiles.

Figures

Figures reproduced from arXiv: 2606.07760 by A\"icha BenTaieb, Alma Andersson, Aya Abdelsalam Ismail, Doron Haviv, Edward De Brouwer, Gabriele Scalia, Hector Corrada Bravo, Kyunghyun Cho, Tommaso Biancalani.

Figure 1
Figure 1. Figure 1: Directed Acyclic Graph (DAG) of the data-generating process. Two unobserved variables U and UC drive X and C, respectively, with C also influencing X, consistent with C ← fC (UC ), X ← fX(C, U). 2.2. Architecture of scCBGM The overall architecture of our model is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Single-cell Concept Bottleneck Generative Model Overview. Top: A concept bottleneck VAE encodes gene expression into known concepts and unknown concepts, which are subsequently decoded to reconstruct the expression profile. The model is trained to match the encoded concepts (cˆk) to the ground truth (ck) while keeping the unknown concepts (uk) independent via the cross-covariance loss. Bottom: Counter-fact… view at source ↗
Figure 3
Figure 3. Figure 3: Standard CBGMs vs. scCBGM under different concept annotations noise. MSE between true and predicted counterfactuals across datasets (3), interventions (5), noise levels (3), and seeds (2). Results We compare standard CBGMs with scCBGM to demonstrate the importance of our architectural modifica￾tions for single-cell data. Experiments use three synthetic datasets (20,000 cells, 5,000 genes) varying in techni… view at source ↗
Figure 4
Figure 4. Figure 4: Counterfactual modeling predicts cellular response to perturbation. Left: UMAP of CD4 T-cell data from the Kang et al. data, showing Control and Stimulated cells, colored by subtype. Stimulated Naive CD4 T cells were held out during training. Right: Control Naive CD4 T cells are edited in silico to predict their stimulated state. scCBGM (second panel) accurately predicts the held-out stimulated Naive CD4 T… view at source ↗
Figure 5
Figure 5. Figure 5: scCBGM enables interpretable control of cells to enhance response to stimulation. Stellate cells with low dosage of TCDD showed limited treatment response, while cells with high dosage showed a clear response. By editing control cells’ pathway activity profile, cells become more sensitive to TCDD, showing a treatment response similar to cells with higher TCDD dosage. Benchmarking on synthetic data We bench… view at source ↗
Figure 6
Figure 6. Figure 6: and [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Same synthetic data as in [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results from hyperparameter sweep for the scCBGM and CBGM model. Error are cell-level MSEs, comparing the predicted counterfactual and the true counterfactual observation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE of the Kang et al. dataset using the original cell type labels. Each point corresponds to a single cell, colored by its annotated cell type. As shown in [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative and quantitative comparison of data reconstruction. UMAP visualizations of the Kang dataset (top-left) and reconstructions from five generative models. Points are colored by cell type. scCBGM and scCBGM-FM successfully reproduce the data distribution with low test-set MSE. Vanilla-FM shows poor reconstruction fidelity due to a lack of granular conditioning, while cVAE and VAE achieve moderate … view at source ↗
Figure 11
Figure 11. Figure 11: , where the stimulation scores shift from low to high after intervention, whereas the cell-type scores remain largely unchanged. The corresponding quantitative results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of rMMD performance between two versions of the auto-encoder backbone of scCBGM: a variational auto-encoder (VAE) and a regular auto-encoder (AE). The boxplots are computed over 4 seeds for each cell population from the Kang et al. (2017) dataset. Lower rMMD is better. D.9. Additional benchmarking results D.9.1. COUNTER-FACTUAL MODELING PREDICTS CELLULAR RESPONSE TO PERTURBATION Tables 17 and 1… view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of scCBGM-FM (edit) and CVAE-FM (edit) on the full Cui et al. (2024) dataset. (Left) rMMD for both methods averaged per cell type over the different cytokine perturbations. (Right) Average rMMD on all cell-type-cytokine pairs [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gene expression trends under in-silico perturbation in the Nault et al. (2023) dataset. The distribution of mean expression values (averaged across cells in each group) for the top 100 upregulated (green) and downregulated (orange) marker genes. The edited cells (center) successfully reproduce the target gene signatures, shifting the expression of responder and non-responder genes towards the levels obser… view at source ↗
Figure 15
Figure 15. Figure 15: Predicted gene expression trends match experimental ground truth. UMAP visualizations of representative genes from the top 100 differentially expressed set analyzed in [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗
read the original abstract

Understanding cellular phenotypes and how they respond to perturbations is critical for disease biology and therapeutic design. Single-cell RNA sequencing enables characterization at cellular resolution, yet the combinatorial space of conditions makes exhaustive experimental mapping infeasible. We introduce single-cell Concept Bottleneck Generative Models (scCBGM), a framework for interpretable and precise counterfactual editing of individual cells. scCBGM adapts concept bottleneck architectures for single-cell data through decoder skip connections and a cross-covariance penalty that promotes disentanglement without dimensional constraints. We extend the framework to flow matching models, enabling concept-guided editing in both encoding-decoding and generation regimes. To enable rigorous evaluation, we develop a synthetic benchmark with ground-truth counterfactuals. Across multiple real datasets, scCBGM demonstrates superior performance in combinatorial generalization and counterfactual prediction, supported by cell-level validation on synthetic data and population-level benchmarks on real datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces single-cell Concept Bottleneck Generative Models (scCBGM) that adapt concept bottleneck architectures to single-cell RNA-seq via decoder skip connections and a cross-covariance penalty for disentanglement (without dimensional constraints), extends the approach to flow matching models for concept-guided editing, constructs a synthetic benchmark with ground-truth counterfactuals, and reports superior performance in combinatorial generalization and counterfactual prediction on multiple real datasets via cell-level synthetic validation and population-level real-data benchmarks.

Significance. If the central claims hold after addressing benchmark realism, the work would provide a useful interpretable framework for counterfactual editing in single-cell data, combining concept bottlenecks with generative modeling to support perturbation analysis in disease biology; the explicit synthetic benchmark with ground truth is a positive step toward rigorous evaluation in this domain.

major comments (2)
  1. [§4] §4 (Synthetic Benchmark): the generative process used to create the synthetic data with ground-truth counterfactuals is not shown to incorporate zero-inflation, high dropout rates, or batch effects typical of real scRNA-seq; without this, cell-level validation on synthetic data does not license the extrapolation that the same architecture and cross-covariance penalty will yield accurate edits on real data.
  2. [§5] §5 (Real-data Experiments): population-level metrics on real datasets cannot substitute for per-cell ground truth, so the combinatorial generalization claim rests entirely on the synthetic regime; the cross-covariance penalty's disentanglement effect is only demonstrated under the synthetic noise model and remains untested under realistic dropout and sparsity.
minor comments (2)
  1. The abstract asserts 'superior performance' without naming the exact baselines, metrics, or statistical tests; these details should be summarized in the abstract for clarity.
  2. Notation for the cross-covariance penalty term should be defined explicitly with an equation number in the methods section to allow direct comparison with related disentanglement penalties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on the synthetic benchmark and real-data evaluation. We address each major comment below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Synthetic Benchmark): the generative process used to create the synthetic data with ground-truth counterfactuals is not shown to incorporate zero-inflation, high dropout rates, or batch effects typical of real scRNA-seq; without this, cell-level validation on synthetic data does not license the extrapolation that the same architecture and cross-covariance penalty will yield accurate edits on real data.

    Authors: We agree that our synthetic data generation process does not explicitly incorporate zero-inflation, high dropout rates, or batch effects characteristic of real scRNA-seq data. The primary goal of the synthetic benchmark is to provide ground-truth counterfactuals for evaluating combinatorial generalization at the cell level, a capability not available in real datasets. This allows us to rigorously test the model's ability to perform precise edits under controlled conditions. While we acknowledge this simplification limits direct claims about robustness to real noise distributions, the consistent superior performance observed on multiple real datasets using population-level metrics provides supporting evidence of practical utility. In the revised manuscript, we will expand the discussion of the synthetic benchmark to explicitly state its assumptions and limitations, and suggest future work on more realistic noise models. This constitutes a partial revision. revision: partial

  2. Referee: [§5] §5 (Real-data Experiments): population-level metrics on real datasets cannot substitute for per-cell ground truth, so the combinatorial generalization claim rests entirely on the synthetic regime; the cross-covariance penalty's disentanglement effect is only demonstrated under the synthetic noise model and remains untested under realistic dropout and sparsity.

    Authors: We concur that population-level metrics on real data cannot replace per-cell ground truth, and thus the strongest evidence for combinatorial generalization comes from the synthetic benchmark. On real datasets, our evaluation follows standard practices in the field by using population-level benchmarks for counterfactual prediction tasks. For the cross-covariance penalty, its role in promoting disentanglement is indeed illustrated and quantified in the synthetic setting where the underlying factors are known. On real data, we demonstrate its benefit through improved editing performance rather than direct disentanglement metrics. In revision, we will clarify the scope of these claims in the text and add a limitations paragraph noting that disentanglement under realistic noise remains to be further investigated. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity in claimed derivation or predictions

full rationale

The paper presents an empirical framework evaluated on external synthetic and real datasets, with performance claims resting on benchmark comparisons rather than any derivation that reduces to its own fitted parameters or self-citations by construction. The abstract and context describe architectural adaptations, a penalty term, and a new benchmark, but contain no load-bearing steps where a 'prediction' or result is definitionally equivalent to an input (e.g., no fitted quantities renamed as predictions, no uniqueness theorems imported from self-citations, and no ansatzes smuggled via prior work). The central claims are supported by independent validation data, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; the model is described as an adaptation of existing concept bottleneck and generative modeling approaches without specifying additional postulates.

pith-pipeline@v0.9.1-grok · 5708 in / 1286 out tokens · 38814 ms · 2026-06-27T22:11:02.967340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    K., Gautam, D., Bevilacqua, B., Imran, A., Shah, R., Naghipourfar, M., Teyssier, N., Ilango, R., Nagaraj, S., Dong, M., et al

    Adduri, A. K., Gautam, D., Bevilacqua, B., Imran, A., Shah, R., Naghipourfar, M., Teyssier, N., Ilango, R., Nagaraj, S., Dong, M., et al. Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pp. 2025–06,

  2. [2]

    arXiv preprint arXiv:2304.06129 (2023) 2, 4

    Oikarinen, T., Das, S., Nguyen, L. M., and Weng, T.-W. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129,

  3. [3]

    Semantic image inversion and editing us- ing rectified stochastic differential equations,

    Rout, L., Chen, Y ., Ruiz, N., Caramanis, C., Shakkottai, S., and Chu, W.-S. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792,

  4. [4]

    and Tsaftaris, S

    Sanchez, P. and Tsaftaris, S. A. Diffusion causal models for counterfactual estimation.arXiv preprint arXiv:2202.10166,

  5. [5]

    Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

    Wang, G., Liu, T., Zhao, J., Cheng, Y ., and Zhao, H. Mod- eling and predicting single-cell multi-gene perturbation responses with scLAMBDA.bioRxiv, 2024a. Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y ., Huang, N., Chen, Y ., Li, X., and Shan, Y . Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024b. Wenteler, A., Occhetta, M...

  6. [6]

    Xia, T., Ribeiro, F. D. S., Rasal, R. R., Kori, A., Mehta, R., and Glocker, B. Decoupled classifier-free guid- ance for counterfactual diffusion models.arXiv preprint arXiv:2506.14399,

  7. [7]

    Additional details on Singe-Cell Concept Bottleneck Generative Models A.1

    12 scCBGM: Interpretable Single-Cell Counterfactual Editing Appendix A. Additional details on Singe-Cell Concept Bottleneck Generative Models A.1. A very short primer on counterfactuals We posit a structural causal model (SCM) M= (G, F, P(U)) with a directed acyclic graph G, endogenous variables V , exogenous noiseU, and assignmentsF={v i ←f i(pai, ui)}. ...

  8. [8]

    We excluded megakaryocytes due to their low cell count (210 cells)

    dataset comprises 24,264 cells across 8 broad-cell types, observed under two conditions: with and without IFN-β stimulation. We excluded megakaryocytes due to their low cell count (210 cells). Data was preprocessed usingscanpy(Wolf et al., 2018), involving median library size normalization, log-transformation of all counts, and filtering to the top 3000 m...

  9. [9]

    PBMC dataset. To benchmark our model on high-fidelity edits that preserve cell phenotype while changing experimental conditions, we focused on identifying granular phenotypes consistent across stimulated and unstimulated cells. We first integrated the two conditions into a unified latent space using Harmony (Korsunsky et al., 2019). Within this unified la...

  10. [10]

    For the 22 scCBGM: Interpretable Single-Cell Counterfactual Editing Kang et al. (2017) experiments, concepts were defined using these original broad-cell types, while stimulation predictions and held-out validations were conducted at the more granular subtype level. We ran our experiments over the same 4 random seeds for all models. Table 6 presents the h...

  11. [11]

    By comparing responder hepatic stellate cells (dosed at 10 and 30µg/kg) with non-responders (all other dosages), we identified differentially regulated pathways

    using thedecouplerpackage (Badia-i Mompel et al., 2022). By comparing responder hepatic stellate cells (dosed at 10 and 30µg/kg) with non-responders (all other dosages), we identified differentially regulated pathways. Specifically, responder stellate cells exhibited high activity 24 scCBGM: Interpretable Single-Cell Counterfactual Editing in TGFβ, PI3K, ...

  12. [12]

    Values are MSE (mean ± std), averaged over interventions (5) and seeds (4). 30 scCBGM: Interpretable Single-Cell Counterfactual Editing This trend is consistent with prior observations in the literature (Ismail et al., 2025), indicating that the model remains stable and effective even as the concept space expands. concepts MSE 5 0.19617±0.00195 20 0.19597...

  13. [13]

    This stark reduction confirms that the known categorical variation is effectively isolated and decoupled 32 scCBGM: Interpretable Single-Cell Counterfactual Editing from the residual unknown layer. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean±Std Stim in Known 0.572 0.577 0.555 0.545 0.541 0.5580±0.0160 Stim in Unknown 0.998 1.000 0.998 0.999 0.999 0.9988±0.00...

  14. [14]

    Using a paired t-test over all 1479 configurations, scCBGM-FM was found to significantly outperform CV AE-FM across the full intervention space (p-val<1e −10) for all metrics. D.9.3. COMPLETENAULT ET AL. (2023)RESULTS In Tables 21, 22, and 23 we report the results of our experiment on the Nault et al. (2023) dataset on all available cell types for the rMM...

  15. [15]

    The successful recovery of the remaining 80% suggests that the model captures downstream regulatory effects beyond the direct inputs

    We note that only 40 of these marker genes overlap with the total set of top 100 genes defining the manipulated pathway concepts (500 total). The successful recovery of the remaining 80% suggests that the model captures downstream regulatory effects beyond the direct inputs. D.11. Cell subtype accuracy To complement our benchmark, we evaluate whether edit...