pith. machine review for the scientific record. sign in

arxiv: 2604.14838 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords single-cell foundation modelslayer-wise embeddingstrajectory inferenceperturbation responseintermediate representationsbiological feature extractioncontext-dependent optima
0
0 comments X

The pith

Intermediate layers of single-cell foundation models often provide better representations for biological tasks than the final layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that the last layer of large single-cell models holds the richest features. Instead, it measures embeddings taken from every layer on two standard tasks: reconstructing cell trajectories and predicting how cells respond to perturbations. Results show the best-performing layer changes with the task and with the cell's current state, sometimes landing in the middle of the network and sometimes right at the start. This matters for anyone using these models because it means default choices may leave useful biological information on the table.

Core claim

Systematic layer-wise testing on scFoundation and Tahoe-X1 shows that trajectory inference accuracy peaks around 60 percent depth through the network and exceeds final-layer performance by 31 percent, while the layer that best predicts perturbation responses can lie anywhere from 0 to 96 percent depth depending on T-cell activation state; first-layer embeddings even surpass every deeper layer when cells are quiescent.

What carries the argument

Layer-wise extraction of embeddings from transformer-based single-cell foundation models, scored on trajectory inference and perturbation-response benchmarks.

If this is right

  • Trajectory reconstruction tools should scan intermediate layers rather than default to the output layer.
  • Perturbation prediction accuracy rises when the extraction depth is matched to the cell's activation or differentiation state.
  • Early layers can capture state-specific features that deeper layers dilute or overwrite.
  • Model deployment in biology requires per-task layer selection instead of a single fixed extraction point.
  • Non-hierarchical encoding of biological signals appears common in these large transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated layer-selection routines could be added to model inference pipelines to improve results without retraining.
  • The same depth-dependence pattern may appear in foundation models trained on other biological modalities such as spatial transcriptomics.
  • Training objectives that explicitly reward useful intermediate representations might reduce the need for post-hoc layer search.
  • If early layers already hold strong biological signals, lighter or shallower models might suffice for many single-cell applications.

Load-bearing premise

The chosen trajectory and perturbation benchmarks measure genuine biological optimality rather than reflecting quirks of the training data or scoring protocol.

What would settle it

Re-running the same layer scans on an independent collection of single-cell datasets and tasks where the reported mid-depth or early-layer optima produce no improvement or even lower scores than the final layer.

Figures

Figures reproduced from arXiv: 2604.14838 by Alberto Magi, Andrew David Bagdanov, Roberto Semeraro, Vincenzo Yuto Civale.

Figure 1
Figure 1. Figure 1: Layer-wise pseudotime correlation. Spearman correlation with ground truth pseudotime across normalized depths. Tahoe-X1 (blue) peaks at 60% depth (ρ = 0.76, 31% above final layer) Perturbation Effect Correlation. We evaluated layer embeddings using representational similar￾ity analysis. For each perturbation p, we computed a differential expression profile DEp (log-fold changes via pseudobulk DESeq2 (Love … view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise perturbation response correlation across T cell activation states. Opti￾mal layers vary dramatically by cellular context, ranging from 0% to 96% depth within the same model. Tahoe-X1 and scFoundation show context-dependent specialization across Rest, Stim8hr, and Stim48hr conditions. encode low-level or technical signals rather than temporal structure. The sharp increase in Tahoe￾X1 between 30–6… view at source ↗
read the original abstract

Current single-cell foundation model benchmarks universally extract final layer embeddings, assuming these represent optimal feature spaces. We systematically evaluate layer-wise representations from scFoundation (100M parameters) and Tahoe-X1 (1.3B parameters) across trajectory inference and perturbation response prediction. Our analysis reveals that optimal layers are task-dependent (trajectory peaks at 60% depth, 31% above final layers) and context-dependent (perturbation optima shift 0-96% across T cell activation states). Notably, first-layer embeddings outperform all deeper layers in quiescent cells, challenging assumptions about hierarchical feature abstraction. These findings demonstrate that "where" to extract features matters as much as "what" the model learns, necessitating systematic layer evaluation tailored to biological task and cellular context rather than defaulting to final-layer embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that single-cell foundation models (scFoundation, Tahoe-X1) yield task- and context-dependent optimal representations in intermediate layers rather than final layers. Trajectory inference performance peaks at ~60% depth (31% above final-layer baselines), while perturbation-response optima shift from 0-96% depth across T-cell activation states; first-layer embeddings outperform all deeper layers in quiescent cells. The work concludes that systematic layer evaluation is required instead of default final-layer extraction.

Significance. If the layer-wise performance differences are robust, the result would directly affect standard practice in single-cell analysis by showing that final-layer embeddings are frequently suboptimal for biologically relevant tasks. The empirical demonstration of context dependence (especially the first-layer advantage in quiescent cells) supplies a concrete, falsifiable observation that could guide future model design and benchmarking.

major comments (3)
  1. [Results (trajectory inference)] Results (trajectory inference): the reported 31% improvement at 60% depth is presented without error bars, cross-validation details, or explicit statement of the primary metric (e.g., Spearman correlation with pseudotime, or a specific trajectory metric). Without these, it is impossible to judge whether the peak is statistically distinguishable from neighboring layers or from final-layer performance.
  2. [Results (perturbation response)] Results (perturbation response): the claim that optima shift 0-96% across T-cell activation states rests on the assumption that the chosen perturbation-prediction benchmark faithfully measures biological fidelity. The manuscript does not report controls for layer-specific feature scale, noise characteristics, or alignment with the pre-training objective, leaving open the possibility that the observed shifts are evaluation artifacts rather than evidence of optimal biological representations.
  3. [Discussion] Discussion (first-layer performance in quiescent cells): the observation that first-layer embeddings outperform deeper layers is load-bearing for the claim that hierarchical abstraction is not uniformly beneficial. The paper should test whether this advantage disappears when the same cells are evaluated on a more complex downstream task or when input-feature preservation is explicitly ablated.
minor comments (2)
  1. [Methods] Methods section should include the exact data splits, number of cells per state, and any hyper-parameter choices for the layer-wise extraction and downstream models.
  2. [Figures] Figure legends lack units or scale for the performance axes and do not indicate whether shaded regions represent standard error or inter-run variability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our work. These have prompted us to enhance the statistical reporting and add necessary controls in the revised manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Results (trajectory inference)] the reported 31% improvement at 60% depth is presented without error bars, cross-validation details, or explicit statement of the primary metric (e.g., Spearman correlation with pseudotime, or a specific trajectory metric). Without these, it is impossible to judge whether the peak is statistically distinguishable from neighboring layers or from final-layer performance.

    Authors: We agree that these details are essential for assessing the robustness of our findings. In the revised manuscript, we have added error bars representing the standard deviation across five independent runs with different seeds. We explicitly state that the primary metric is the correlation with ground-truth pseudotime (Spearman rank correlation). We also include cross-validation details in the Methods and show via statistical testing that the 60% depth peak is significantly superior to the final layer performance. revision: yes

  2. Referee: [Results (perturbation response)] the claim that optima shift 0-96% across T-cell activation states rests on the assumption that the chosen perturbation-prediction benchmark faithfully measures biological fidelity. The manuscript does not report controls for layer-specific feature scale, noise characteristics, or alignment with the pre-training objective, leaving open the possibility that the observed shifts are evaluation artifacts rather than evidence of optimal biological representations.

    Authors: We have incorporated controls for layer-specific feature scale by normalizing all embeddings to unit norm before evaluation. We also report per-layer noise characteristics (e.g., variance in embedding dimensions). Alignment with the pre-training objective remains difficult to assess directly, but we provide evidence that the observed shifts align with known biological differences in T-cell states and are reproducible across benchmarks. We have updated the Discussion to address the possibility of artifacts. revision: partial

  3. Referee: [Discussion] the observation that first-layer embeddings outperform deeper layers is load-bearing for the claim that hierarchical abstraction is not uniformly beneficial. The paper should test whether this advantage disappears when the same cells are evaluated on a more complex downstream task or when input-feature preservation is explicitly ablated.

    Authors: This is a valid point. We have expanded the Discussion to explain that in quiescent cells, the primary requirement is faithful representation of input features, which the first layer naturally provides, while deeper layers may introduce unnecessary abstraction. However, conducting the suggested additional experiments on complex tasks and ablations is not feasible within the current computational budget and timeline. We acknowledge this as a limitation and propose it as an important direction for future research. revision: partial

standing simulated objections not resolved
  • Additional experiments to test the first-layer advantage on more complex downstream tasks or with explicit input-feature ablations.

Circularity Check

0 steps flagged

No circularity: empirical layer-wise benchmarking on external tasks

full rationale

The paper reports an empirical study that extracts embeddings from successive layers of two pre-trained single-cell foundation models and measures their performance on independent downstream benchmarks (trajectory inference and perturbation response prediction). No equations, parameter fits, or derivations appear in the provided text; claims rest on observed accuracy differences across layers and cellular contexts. These measurements are directly falsifiable against the same external tasks and do not reduce to self-definitions, renamed inputs, or self-citation chains. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that current benchmarks treat final-layer embeddings as optimal; no free parameters are introduced, no new entities postulated, and no ad-hoc axioms beyond standard ML evaluation practices.

axioms (1)
  • domain assumption Current single-cell foundation model benchmarks universally extract final layer embeddings as optimal feature spaces.
    This is the assumption explicitly challenged in the abstract as the starting point for the evaluation.

pith-pipeline@v0.9.0 · 5435 in / 1194 out tokens · 25813 ms · 2026-05-10T11:45:52.021090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Fundamental limitations of foundation models in single-cell transcriptomics.bioRxiv, pp

    Srijan Atti and Shankar Subramaniam. Fundamental limitations of foundation models in single-cell transcriptomics.bioRxiv, pp. 2025–06,

  2. [2]

    Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp

    Shreshth Gandhi, Farnoosh Javadi, Valentine Svensson, Umair Khan, Matthew G Jones, John Yu, Daniele Merico, Hani Goodarzi, and Nima Alidoust. Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp. 2025–10,

  3. [3]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    5 Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,

  4. [4]

    scdrugmap: Benchmarking large foundation models for drug response prediction.arXiv preprint arXiv:2505.05612,

    Qing Wang, Yining Pan, Minghao Zhou, Zijia Tang, Yanfei Wang, Guangyu Wang, and Qianqian Song. scdrugmap: Benchmarking large foundation models for drug response prediction.arXiv preprint arXiv:2505.05612,

  5. [5]

    Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits.bioRxiv, pp

    Ronghui Zhu, Emma Dann, Jun Yan, Justine Reyes Retana, Ryunosuke Goto, Reese C Guitche, Lillian K Petersen, Mineto Ota, Jonathan K Pritchard, and Alexander Marson. Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits.bioRxiv, pp. 2025–12,

  6. [6]

    Trajectory Dataset.The clonally resolved human hematopoietic differentiation dataset (LT- scSeq, LARRY-based) was obtained from the CLADES study Gao et al

    6 APPENDIXA DATASETSDETAILS All datasets used in this study are publicly available and were released after 2024, ensuring no overlap with the pretraining data of the evaluated models. Trajectory Dataset.The clonally resolved human hematopoietic differentiation dataset (LT- scSeq, LARRY-based) was obtained from the CLADES study Gao et al. (2025). The raw d...