arxiv: 2605.07938 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Prototype Guided Post-pretraining for Single-Cell Representation Learning

Sachini Weerasekara , Natasha Darras , Sagar Kamarthi , Colles Price , Jacqueline Isaacs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords single-cell representation learningpost-pretrainingmarker geneslatent embeddingsfoundation modelscomputational biologyprototype guidancegene expression

0 comments

The pith

A post-pretraining stage that uses marker-gene sets as priors refines cell embeddings and lifts downstream performance by up to 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-cell pretrained models treat genes as tokens and cells as sentences yet remain limited by long-tailed cell-type distributions and covariate shifts even after fine-tuning. CellRefine inserts an intermediate post-pretraining phase that incorporates known marker-gene sets to reshape the latent space of cell embeddings. The method delivers consistent gains on multiple computational biology tasks. This matters because more accurate cell representations can improve downstream analyses of cellular function without requiring larger pretraining corpora or heavier fine-tuning.

Core claim

CellRefine is a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. It employs a multi-faceted objective that incorporates marker-gene sets as structural priors to guide refinement of the latent embedding manifold of cells. Empirical results across multiple computational biology tasks show that this stage consistently improves downstream performance, yielding gains up to 15 percent.

What carries the argument

Marker-gene sets used as structural priors inside a multi-faceted post-pretraining objective that reshapes the cell latent embedding manifold.

If this is right

Existing single-cell foundation models can be improved without retraining from scratch by inserting the guided post-pretraining stage.
Performance gains hold across tasks that suffer from long-tailed cell-type distributions and covariate shifts in gene expression.
The refined embeddings support more accurate downstream analyses of cellular regulatory logic.
The approach keeps the original pretraining and fine-tuning pipelines intact while adding only the intermediate refinement step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to other sequence-based biological foundation models if suitable structural priors can be identified for those domains.
Performance sensitivity to the exact choice of marker genes could be tested by swapping marker sets and measuring downstream variance.
If the gains persist on very large or noisy datasets, the technique might reduce reliance on extensive labeled data during fine-tuning.
The refinement could interact with batch-correction methods to further stabilize embeddings under strong covariate shifts.

Load-bearing premise

Marker-gene sets supply reliable structural priors that improve the embedding manifold without introducing selection bias or overfitting to the chosen genes.

What would settle it

Apply CellRefine to an existing single-cell pretrained model on a held-out dataset with established marker genes and compare against direct fine-tuning; if downstream task metrics show no improvement or a decline, the claimed benefit of the guided post-pretraining stage is falsified.

Figures

Figures reproduced from arXiv: 2605.07938 by Colles Price, Jacqueline Isaacs, Natasha Darras, Sachini Weerasekara, Sagar Kamarthi.

**Figure 1.** Figure 1: Long-tail distribution of the cell types in the blood cell dataset [10]. The long-tail induced extreme cell-type imbalance is a direct reflection of biological reality, where critical but rare populations, such as disease-initiating stem cells or specialized immune cells, often constitute less than 0.1% of a sample [11, 12, 13]. Our analysis confirms this trend across major public benchmarks, revealing p… view at source ↗

**Figure 2.** Figure 2: Overview of CellRefine, a post-pretraining method for single-cell foundation models. CellRefine refines the latent cell embedding space of a given foundation model on a target cell dataset before task-specific fine-tuning. overall training setup can be written as, fθ pt CellRefine −−−−−→ fθ pp Downstream Fine-tuning −−−−−−−−−−−−→ fθ ft 3.2 Prototype-guided Learning The CellRefine post-pretraining stage is … view at source ↗

**Figure 3.** Figure 3: Out-of-domain zero-shot cell identity prediction performance (recall@k) on LivST1 and LivST2. We report on-domain cell identity prediction results in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of latent embeddings distribution of tail cells in the Blood cell dataset [ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Marker gene set and cell type prototype creation procedure. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of latent embeddings for cells in the Pancreas dataset [ [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of latent embeddings for cells in the liver dataset [ [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of latent embeddings for cells in the Myeloid dataset before (Figure 8a) and [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of latent embeddings for tail cells in the LivST2 dataset before (Figure 9a) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Long-tailed distributions of single-cell datasets [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

read the original abstract

Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells. Across multiple computational biology tasks, empirical results show that CellRefine consistently improves downstream performance, yielding gains up to 15%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CellRefine adds a post-pretraining stage guided by marker-gene prototypes to single-cell models and claims up to 15% downstream gains, but the experimental details leave the leakage concern from the stress-test unresolved.

read the letter

The main takeaway is that this paper proposes CellRefine as an intermediate step after pretraining but before fine-tuning. It uses a multi-faceted loss that pulls cell embeddings toward prototypes derived from marker-gene sets. That is the concrete new element, and it directly targets the long-tailed cell-type problem and covariate shift issues mentioned in the abstract. The approach is straightforward to implement on top of existing single-cell foundation models, which is a plus for anyone already running those pipelines.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CellRefine, a post-pretraining method inserted between pretraining and fine-tuning of single-cell foundation models. It employs a multi-faceted objective that incorporates marker-gene sets as structural priors to refine the latent embedding manifold of cells. The central claim is that this procedure yields consistent improvements on multiple computational biology downstream tasks, with empirical gains reaching up to 15%.

Significance. If the reported gains prove reproducible and free of leakage from marker-gene selection, the approach could supply a lightweight, targeted refinement stage that mitigates long-tailed cell-type distributions and covariate shifts without requiring full model retraining. This would be a practical addition to the single-cell representation learning toolkit.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The claim of gains up to 15% is presented without any description of baselines, datasets, statistical tests, ablation studies, or cross-validation protocol, rendering the central empirical result unverifiable from the manuscript.
[§3] §3 (Method): The construction and provenance of the marker-gene sets are not specified, including whether the sets are held out from all downstream evaluation splits; without this, the structural-prior interpretation cannot be distinguished from possible prior leakage or selection bias.

minor comments (1)

[Abstract and §3] The abstract and method sections would benefit from explicit equations for the multi-faceted objective to clarify how the marker-gene priors enter the loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and will incorporate revisions to improve the verifiability and transparency of the work.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The claim of gains up to 15% is presented without any description of baselines, datasets, statistical tests, ablation studies, or cross-validation protocol, rendering the central empirical result unverifiable from the manuscript.

Authors: We agree that the abstract's presentation of the up to 15% gains lacks sufficient context for immediate verification. We will revise the abstract to include a concise description of the evaluation protocol, key baselines, and datasets. In Section 4, we will expand the text to explicitly detail the statistical tests performed, the ablation studies on the multi-faceted objective, and the cross-validation protocol (including number of folds and runs). These changes will ensure all empirical claims are fully verifiable directly from the manuscript. revision: yes
Referee: [§3] §3 (Method): The construction and provenance of the marker-gene sets are not specified, including whether the sets are held out from all downstream evaluation splits; without this, the structural-prior interpretation cannot be distinguished from possible prior leakage or selection bias.

Authors: We thank the referee for identifying this gap in methodological detail. We will revise Section 3 to specify the construction process and provenance of the marker-gene sets, drawing from established curated databases and literature sources. We will also add an explicit statement and supporting evidence that the sets are held out from all downstream evaluation splits, with no overlap or selection from the fine-tuning or test data. This will confirm that the marker genes function purely as structural priors during post-pretraining and eliminate any possibility of leakage or bias. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; purely empirical method

full rationale

The paper introduces CellRefine as an empirical post-pretraining procedure that incorporates marker-gene sets as structural priors to refine embeddings, with performance gains demonstrated via downstream experiments. No equations, derivations, or self-referential definitions are present that would reduce any claimed result to its inputs by construction. The method description and results do not rely on fitted parameters renamed as predictions, load-bearing self-citations, or imported uniqueness theorems. The central claims remain independent and falsifiable through external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level description of the new method; detailed ledger cannot be populated without full text.

pith-pipeline@v0.9.0 · 5471 in / 1132 out tokens · 51053 ms · 2026-05-11T02:55:08.089864+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells... L_Total = L_MLM + λ1 L_prototype + λ2 L_lineage + λ3 L_GMVE
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct marker gene sets for each cell type... ordered sequence of marker genes as a prototype... prototype-guided regularization loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 1 internal anchor

[1]

Machine learning for precision diagnostics of autoimmunity.Scientific Reports, 14(1):27848, 2024

Jan Kruta, Raphael Carapito, Marten Trendelenburg, Thierry Martin, Marta Rizzi, Reinhard E V oll, Andrea Cavalli, Eriberto Natali, Patrick Meier, Marc Stawiski, et al. Machine learning for precision diagnostics of autoimmunity.Scientific Reports, 14(1):27848, 2024

work page 2024
[2]

A hybrid machine learning approach for the personalized prognostication of aggressive skin cancers.npj Digital Medicine, 8(1):15, 2025

Tom W Andrew, Mogdad Alrawi, Ruth Plummer, Nick Reynolds, Vern Sondak, Isaac Brownell, Penny E Lovat, Aidan Rose, and Sophia Z Shalhout. A hybrid machine learning approach for the personalized prognostication of aggressive skin cancers.npj Digital Medicine, 8(1):15, 2025

work page 2025
[3]

Learning the natural history of human disease with generative transformers.Nature, pages 1–9, 2025

Artem Shmatko, Alexander Wolfgang Jung, Kumar Gaurav, Søren Brunak, Laust Hvas Mortensen, Ewan Birney, Tom Fitzgerald, and Moritz Gerstung. Learning the natural history of human disease with generative transformers.Nature, pages 1–9, 2025

work page 2025
[4]

Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

Christina V Theodoris, Ling Xiao, Anant Chopra, Mark D Chaffin, Zeina R Al Sayed, Matthew C Hill, Helene Mantineo, Elizabeth M Brydon, Zexian Zeng, X Shirley Liu, et al. Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

work page 2023
[5]

Cellplm: pre-training of cell language model beyond single cells.BioRxiv, pages 2023–10, 2023

Hongzhi Wen, Wenzhuo Tang, Xinnan Dai, Jiayuan Ding, Wei Jin, Yuying Xie, and Jiliang Tang. Cellplm: pre-training of cell language model beyond single cells.BioRxiv, pages 2023–10, 2023

work page 2023
[6]

Reconstructing cell lineage trees from phenotypic features with metric learning.arXiv preprint arXiv:2503.13925, 2025

Da Kuang, Guanwen Qiu, and Junhyong Kim. Reconstructing cell lineage trees from phenotypic features with metric learning.arXiv preprint arXiv:2503.13925, 2025

work page arXiv 2025
[7]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai.Nature methods, 21(8): 1470–1480, 2024

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai.Nature methods, 21(8): 1470–1480, 2024

work page 2024
[8]

scgen predicts single-cell perturbation responses.Nature methods, 16(8):715–721, 2019

Mohammad Lotfollahi, F Alexander Wolf, and Fabian J Theis. scgen predicts single-cell perturbation responses.Nature methods, 16(8):715–721, 2019

work page 2019
[9]

Adaptive resampling for improved machine learning in imbalanced single-cell datasets

Zeinab Navidi, Akshaya Thoutam, Madeline Hughes, Srivatsan Raghavan, Peter S Winter, Lorin Crawford, and Ava P Amini. Adaptive resampling for improved machine learning in imbalanced single-cell datasets. bioRxiv, pages 2025–11, 2025

work page 2025
[10]

Massively parallel digital transcriptional profiling of single cells.Nature communications, 8(1):14049, 2017

Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells.Nature communications, 8(1):14049, 2017

work page 2017
[11]

Single-cell map of diverse immune phenotypes in the breast tumor microenvironment.Cell, 174(5):1293–1308, 2018

Elham Azizi, Ambrose J Carr, George Plitas, Andrew E Cornish, Catherine Konopacki, Sandhya Prab- hakaran, Juozas Nainys, Kenmin Wu, Vaidotas Kiseliovas, Manu Setty, et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment.Cell, 174(5):1293–1308, 2018

work page 2018
[12]

Single-cell messenger rna sequencing reveals rare intestinal cell types.Nature, 525(7568):251–255, 2015

Dominic Grün, Anna Lyubimova, Lennart Kester, Kay Wiebrands, Onur Basak, Nobuo Sasaki, Hans Clevers, and Alexander Van Oudenaarden. Single-cell messenger rna sequencing reveals rare intestinal cell types.Nature, 525(7568):251–255, 2015

work page 2015
[13]

Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors.Science, 356(6335):eaah4573, 2017

Alexandra-Chloé Villani, Rahul Satija, Gary Reynolds, Siranush Sarkizova, Karthik Shekhar, James Fletcher, Morgane Griesbeck, Andrew Butler, Shiwei Zheng, Suzan Lazo, et al. Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors.Science, 356(6335):eaah4573, 2017

work page 2017
[14]

Langcell: Language-cell pre-training for cell identity understanding.arXiv preprint arXiv:2405.06708, 2024

Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, and Zaiqing Nie. Langcell: Language-cell pre-training for cell identity understanding.arXiv preprint arXiv:2405.06708, 2024

work page arXiv 2024
[15]

Predicting transcriptional outcomes of novel multigene perturbations with gears.Nature Biotechnology, 42(6):927–935, 2024

Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with gears.Nature Biotechnology, 42(6):927–935, 2024

work page 2024
[16]

scpram accurately predicts single-cell gene expression perturbation response based on attention mechanism.Bioinformatics, 40(5):btae265, 2024

Qun Jiang, Shengquan Chen, Xiaoyang Chen, and Rui Jiang. scpram accurately predicts single-cell gene expression perturbation response based on attention mechanism.Bioinformatics, 40(5):btae265, 2024

work page 2024
[17]

Statistical analysis of gene expression microarray data.CRC press, 2003

Xiangqin Cui and Gary A Churchill. Statistical analysis of gene expression microarray data.CRC press, 2003

work page 2003
[18]

Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments.Statistica sinica, pages 111–139, 2002

Sandrine Dudoit, Yee Hwa Yang, Matthew J Callow, and Terence P Speed. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments.Statistica sinica, pages 111–139, 2002

work page 2002
[19]

Hidden markov models for microarray time course data in multiple biological conditions.Bioinformatics, 19:i264–i272, 2003

Alexander Schliep, Alexander Schönhuth, and Carsten Steinhoff. Hidden markov models for microarray time course data in multiple biological conditions.Bioinformatics, 19:i264–i272, 2003. 10

work page 2003
[20]

Linear models and empirical bayes methods for assessing differential expression in microarray experiments.Statistical applications in genetics and molecular biology, 3(1), 2004

Gordon K Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments.Statistical applications in genetics and molecular biology, 3(1), 2004

work page 2004
[21]

Significance analysis of microarrays applied to the ionizing radiation response.Proceedings of the National Academy of Sciences, 98(9):5116–5121, 2001

Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Significance analysis of microarrays applied to the ionizing radiation response.Proceedings of the National Academy of Sciences, 98(9):5116–5121, 2001

work page 2001
[22]

Gene selection for cancer classifi- cation using support vector machines.Machine learning, 46(1):389–422, 2002

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classifi- cation using support vector machines.Machine learning, 46(1):389–422, 2002

work page 2002
[23]

Support vector machine classification and validation of cancer tissue samples using microarray expression data.Bioinformatics, 16(10):906–914, 2000

Terrence S Furey, Nello Cristianini, Nigel Duffy, David W Bednarski, Michel Schummer, and David Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data.Bioinformatics, 16(10):906–914, 2000

work page 2000
[24]

Random forests for gene expression analysis.BMC bioinformatics, 7(1):1–13, 2006

Ramón Díaz-Uriarte and Sara Alvarez De Andres. Random forests for gene expression analysis.BMC bioinformatics, 7(1):1–13, 2006

work page 2006
[25]

Gene selection and classifica- tion of microarray data using random forest.BMC bioinformatics, 11(1):1–13, 2010

Benjamin A Goldstein, Alan E Hubbard, Adele Cutler, and Lisa F Barcellos. Gene selection and classifica- tion of microarray data using random forest.BMC bioinformatics, 11(1):1–13, 2010

work page 2010
[26]

Ensemble methods for gene expression microarray analysis.Applied bioinformatics, 2(2):75–83, 2003

Aik Choon Tan and David Gilbert. Ensemble methods for gene expression microarray analysis.Applied bioinformatics, 2(2):75–83, 2003

work page 2003
[27]

An ensemble approach for gene expression data classification.BMC bioinformatics, 9(1):1–16, 2008

Mehdi Pirooznia, Jack Y Yang, Mary Qu Yang, and Youping Deng. An ensemble approach for gene expression data classification.BMC bioinformatics, 9(1):1–16, 2008

work page 2008
[28]

A scaling normalization method for differential expression analysis of rna-seq data.Genome biology, 11(3):1–9, 2010

Mark D Robinson and Alicia Oshlack. A scaling normalization method for differential expression analysis of rna-seq data.Genome biology, 11(3):1–9, 2010

work page 2010
[29]

Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):1–21, 2014

Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2.Genome biology, 15(12):1–21, 2014

work page 2014
[30]

edger: a bioconductor package for differential expression analysis of digital gene expression data.Bioinformatics, 26(1):139–140, 2010

Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data.Bioinformatics, 26(1):139–140, 2010

work page 2010
[31]

Predicting the sequence specificities of dna-and rna-binding proteins by deep learning.Nature biotechnology, 33(8):831–838, 2015

Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning.Nature biotechnology, 33(8):831–838, 2015

work page 2015
[32]

Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015

Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015

work page 2015
[33]

Deep learning for regulatory genomics.Nature biotechnology, 37(9):1082–1090, 2019

James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and Amalio Telenti. Deep learning for regulatory genomics.Nature biotechnology, 37(9):1082–1090, 2019

work page 2019
[34]

Dimensionality reduction for visualizing single-cell data using umap.Nature biotechnology, 37(1):38–44, 2019

Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap.Nature biotechnology, 37(1):38–44, 2019

work page 2019
[35]

Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):1–5, 2018

F Alexander Wolf, Philipp Angerer, and Fabian J Theis. Scanpy: large-scale single-cell gene expression data analysis.Genome biology, 19(1):1–5, 2018

work page 2018
[36]

Sc3: consensus clustering of single-cell rna-seq data.Nature methods, 14(5):483–486, 2017

Vladimir Yu Kiselev, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, Wolf Reik, Mauricio Barahona, Anthony R Green, et al. Sc3: consensus clustering of single-cell rna-seq data.Nature methods, 14(5):483–486, 2017

work page 2017
[37]

Seurat: tools for single cell genomics.Nature biotechnology, 36(4):411–420, 2018

Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. Seurat: tools for single cell genomics.Nature biotechnology, 36(4):411–420, 2018

work page 2018
[38]

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.Nature biotechnology, 32(4):381–386, 2014

Cole Trapnell, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.Nature biotechnology, 32(4):381–386, 2014

work page 2014
[39]

Diffusion pseudotime robustly reconstructs lineage branching.Nature methods, 13(10):845–848, 2016

Laleh Haghverdi, Maren Büttner, F Alexander Wolf, Florian Buettner, and Fabian J Theis. Diffusion pseudotime robustly reconstructs lineage branching.Nature methods, 13(10):845–848, 2016

work page 2016
[40]

Effective gene expression prediction from sequence by integrating long-range interactions.Nature methods, 18(10):1196–1203, 2021

Ziga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. Effective gene expression prediction from sequence by integrating long-range interactions.Nature methods, 18(10):1196–1203, 2021. URLhttps://www.nature.com/articles/s41592-021-01252-x. 11

work page 2021
[41]

scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data.Nature Machine Intelligence, 4(10):852–866, 2022

Fan Yang, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui Lu, and Jianhua Yao. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data.Nature Machine Intelligence, 4(10):852–866, 2022

work page 2022
[42]

Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective.Briefings in Bioinformatics, 2025

Yang Wang, Jiaqi Chen, Qian Li, Xuegong Wang, Jianhua Tang, and Rui Zhang. Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective.Briefings in Bioinformatics, 2025. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC11970898/

work page 2025
[43]

Integrating multi-modal information to detect spatial domains of spatial transcriptomics by graph attention network.Computers in Biology and Medicine, 2023

Yi Yang, Zunpeng Liu, Jiajia Liao, and Luonan Chen. Integrating multi-modal information to detect spatial domains of spatial transcriptomics by graph attention network.Computers in Biology and Medicine, 2023. URLhttps://www.sciencedirect.com/science/article/abs/pii/S1673852723001418

work page 2023
[44]

Cellclique: Dissecting tumor microenvironments at the single cell level using generative ai and spatial transcriptomics.Cancer Research, 85(8_Supplement_1):2418–2418, 2025

Sachini Weerasekara, Natasha Darras, Nicolas Fernandez, Melinda Chen, Alina Ainbinder, and Colles Price. Cellclique: Dissecting tumor microenvironments at the single cell level using generative ai and spatial transcriptomics.Cancer Research, 85(8_Supplement_1):2418–2418, 2025

work page 2025
[45]

Deep learning in spatially resolved transcriptomics: a comprehensive technical view.Briefings in Bioinformatics, 25(2),

Roxana Zahedi, Reza Ghamsari, Ahmadreza Argha, Callum Macphillamy, Amin Beheshti, Roohallah Alizadehsani, Nigel H Lovell, Mohammad Lotfollahi, and Hamid Alinejad-Rokny. Deep learning in spatially resolved transcriptomics: a comprehensive technical view.Briefings in Bioinformatics, 25(2),

work page
[46]

URLhttps://academic.oup.com/bib/article/25/2/bbae082/7628264

work page
[47]

Deep learning-based multimodal spatial transcriptomics analysis for cancer.Methods in Molecular Biology, 2024

Fayyaz Ahmad, Li Zhang, and Ming Chen. Deep learning-based multimodal spatial transcriptomics analysis for cancer.Methods in Molecular Biology, 2024. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC11431148/

work page 2024
[48]

Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

work page 2018
[49]

Joint probabilistic modeling of single-cell multi-omic data with totalvi.Nature methods, 18(3): 272–282, 2021

Adam Gayoso, Zoë Steier, Romain Lopez, Jeffrey Regier, Kristopher L Nazor, Aaron Streets, and Nir Yosef. Joint probabilistic modeling of single-cell multi-omic data with totalvi.Nature methods, 18(3): 272–282, 2021

work page 2021
[50]

A foundation model of transcription across human cell types

Alexander Karollus, Thomas Mauermeier, Maximilian Holzleitner, Johannes Lehner, Irina Poernbacher, Stefan Schoenauer, and Julien Gagneur. A foundation model of transcription across human cell types. Nature, 2024. URLhttps://www.nature.com/articles/s41586-024-08391-z

work page 2024
[51]

Enhancing personalized gene expression prediction from dna sequences using genomic foundation models.Nature Communications,

Zeyu Huang, Luis E Carvalho, Theo A Knijnenburg, Stuart J Aitken, Amaia Lujambio, Daifeng Wang, Mathieu Lupien, Anshul Kundaje, David R Kelley, and Christina S Leslie. Enhancing personalized gene expression prediction from dna sequences using genomic foundation models.Nature Communications,

work page
[52]

URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC11416237/

work page
[53]

Nicheformer: a foundation model for single-cell and spatial omics.bioRxiv, 2024

Alvaro Ciudad Schaar, Wajid Jawaid, Yinhan Yang, Katie Branson, Arian R Vento, Emma Dann, and Sarah A Teichmann. Nicheformer: a foundation model for single-cell and spatial omics.bioRxiv, 2024. URLhttps://www.biorxiv.org/content/10.1101/2024.04.15.589472v1.full

work page doi:10.1101/2024.04.15.589472v1.full 2024
[54]

Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics.Nature reviews Molecular cell biology, 26(1):11–31, 2025

Gunsagar S Gulati, Jeremy Philip D’Silva, Yunhe Liu, Linghua Wang, and Aaron M Newman. Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics.Nature reviews Molecular cell biology, 26(1):11–31, 2025

work page 2025
[55]

A single-cell compendium of human cerebrospinal fluid identifies disease-associated immune cell populations

Claudia Cantoni, Roman A Smirnov, Maria Firulyova, Prabhakar S Andhey, Tara R Bradstreet, Ekaterina Esaulova, Marina Terekhova, Elizabeth A Schwarzkopf, Nada M Abdalla, Maksim Kleverov, et al. A single-cell compendium of human cerebrospinal fluid identifies disease-associated immune cell populations. The Journal of Clinical Investigation, 135(1), 2025

work page 2025
[56]

Tutorial: guidelines for the experimental design of single-cell rna sequencing studies.Nature protocols, 13(12):2742–2757, 2018

Atefeh Lafzi, Catia Moutinho, Simone Picelli, and Holger Heyn. Tutorial: guidelines for the experimental design of single-cell rna sequencing studies.Nature protocols, 13(12):2742–2757, 2018

work page 2018
[57]

Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

work page 2023
[58]

A systematic review on long-tailed learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

Chongsheng Zhang, George Almpanidis, Gaojuan Fan, Binquan Deng, Yanbo Zhang, Ji Liu, Aouaidjia Kamel, Paolo Soda, and João Gama. A systematic review on long-tailed learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

work page 2025
[59]

Synthetic data generation: a privacy-preserving approach to accelerate rare disease research.Frontiers in Digital Health, 7:1563991, 2025

Jorge M Mendes, Aziz Barbar, and Marwa Refaie. Synthetic data generation: a privacy-preserving approach to accelerate rare disease research.Frontiers in Digital Health, 7:1563991, 2025. 12

work page 2025
[60]

Post-pre-training for modality alignment in vision-language foundation models

Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. Post-pre-training for modality alignment in vision-language foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4256–4266, 2025

work page 2025
[61]

Optimal marker gene selection for cell type discrimination in single cell analyses.Nature communications, 12(1):1186, 2021

Bianca Dumitrascu, Soledad Villar, Dustin G Mixon, and Barbara E Engelhardt. Optimal marker gene selection for cell type discrimination in single cell analyses.Nature communications, 12(1):1186, 2021

work page 2021
[62]

Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders

Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulku- maran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016

work page Pith review arXiv 2016
[63]

A python library for probabilistic analysis of single-cell omics data.Nature biotechnology, 40(2):163–166, 2022

Adam Gayoso, Romain Lopez, Galen Xing, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Edouard Mehlman, Maxime Langevin, et al. A python library for probabilistic analysis of single-cell omics data.Nature biotechnology, 40(2):163–166, 2022

work page 2022
[64]

Transformer for one stop interpretable cell type annotation.Nature Communications, 14(1):223, 2023

Jiawei Chen, Hao Xu, Wanyu Tao, Zhaoxiong Chen, Yuxuan Zhao, and Jing-Dong J Han. Transformer for one stop interpretable cell type annotation.Nature Communications, 14(1):223, 2023

work page 2023
[65]

scclassify: sample size estimation and multiscale classification of cells using single and multiple reference.Molecular systems biology, 16(6):e9389, 2020

Yingxin Lin, Yue Cao, Hani Jieun Kim, Agus Salim, Terence P Speed, David M Lin, Pengyi Yang, and Jean Yee Hwa Yang. scclassify: sample size estimation and multiscale classification of cells using single and multiple reference.Molecular systems biology, 16(6):e9389, 2020

work page 2020
[66]

A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells

Sijin Cheng, Ziyi Li, Ranran Gao, Baocai Xing, Yunong Gao, Yu Yang, Shishang Qin, Lei Zhang, Hanqiang Ouyang, Peng Du, et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell, 184(3):792–809, 2021

work page 2021
[67]

Neuronal vulnerability and multilineage diversity in multiple sclerosis.Nature, 573(7772):75–82, 2019

Lucas Schirmer, Dmitry Velmeshev, Staffan Holmqvist, Max Kaufmann, Sebastian Werneburg, Diane Jung, Stephanie Vistnes, John H Stockley, Adam Young, Maike Steindel, et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis.Nature, 573(7772):75–82, 2019

work page 2019
[68]

Transcriptional and cellular diversity of the human heart.Circulation, 142(5):466–482, 2020

Nathan R Tucker, Mark Chaffin, Stephen J Fleming, Amelia W Hall, Victoria A Parsons, Kenneth C Bedi Jr, Amer-Denis Akkad, Caroline N Herndon, Alessandro Arduini, Irinna Papangeli, et al. Transcriptional and cellular diversity of the human heart.Circulation, 142(5):466–482, 2020

work page 2020
[69]

Single-cell rna sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma.Nature communications, 11(1): 2285, 2020

Nayoung Kim, Hong Kwan Kim, Kyungjong Lee, Yourae Hong, Jong Ho Cho, Jung Won Choi, Jung-Il Lee, Yeon-Lim Suh, Bo Mi Ku, Hye Hyeon Eum, et al. Single-cell rna sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma.Nature communications, 11(1): 2285, 2020

work page 2020
[70]

10x Genomics Datasets

10x Genomics. 10x Genomics Datasets. https://www.10xgenomics.com/datasets, 2025. Accessed: 2025-11-20

work page 2025
[71]

Multiplexed droplet single-cell rna-sequencing using natural genetic variation.Nature biotechnology, 36(1):89–94, 2018

Hyun Min Kang, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, Simon Wong, Lauren Byrnes, Cristina M Lanata, et al. Multiplexed droplet single-cell rna-sequencing using natural genetic variation.Nature biotechnology, 36(1):89–94, 2018

work page 2018
[72]

Gururangan, A

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks.arXiv preprint arXiv:2004.10964, 2020

work page arXiv 2004
[73]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[74]

Ihor Kendiukhov. Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of geneformer and scgpt.arXiv preprint arXiv:2603.02952, 2026

work page arXiv 2026
[75]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[76]

Trends in adopting industry 4.0 for asset life cycle management for sustainability: a keyword co-occurrence network review and analysis.Sustainability, 14(19):12233, 2022

Sachini Weerasekara, Zhenyuan Lu, Burcu Ozek, Jacqueline Isaacs, and Sagar Kamarthi. Trends in adopting industry 4.0 for asset life cycle management for sustainability: a keyword co-occurrence network review and analysis.Sustainability, 14(19):12233, 2022

work page 2022
[77]

Reinforcement learning for disas- sembly task control.Computers & Industrial Engineering, 190:110044, 2024

Sachini Weerasekara, Wei Li, Jacqueline Isaacs, and Sagar Kamarthi. Reinforcement learning for disas- sembly task control.Computers & Industrial Engineering, 190:110044, 2024

work page 2024
[78]

Improvements to disassembly lot sizing with task control through reinforcement learning.Journal of Advanced Manufacturing and Processing, 7(4):e70032, 2025

Sachini Weerasekara, Wei Li, Jacqueline Isaacs, and Sagar Kamarthi. Improvements to disassembly lot sizing with task control through reinforcement learning.Journal of Advanced Manufacturing and Processing, 7(4):e70032, 2025. 13

work page 2025
[79]

GlobularClusterAges,

Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, and Jill P Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43):155...

work page doi:10.1073/pnas 2005
[80]

The molecular signatures database (msigdb) hallmark gene set collection.Cell Systems, 1(6):417–425,

Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P Mesirov, and Pablo Tamayo. The molecular signatures database (msigdb) hallmark gene set collection.Cell Systems, 1(6):417–425,

work page

Showing first 80 references.