pith. sign in

arxiv: 2606.27752 · v1 · pith:CPZG773Hnew · submitted 2026-06-26 · 💻 cs.LG

PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction

Pith reviewed 2026-06-29 05:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords single-cell perturbationreinforcement learningverifier rewardsbiological consistencytranscriptomic predictionflow-matching generator
0
0 comments X

The pith

PerturbCellRL post-trains a flow-matching generator with four cell-level verifiers as RL rewards to improve biological consistency of individual perturbation predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PerturbCellRL as a reinforcement learning method that refines pretrained single-cell transcriptomic generators after initial training. It defines four verifiers as reward signals—Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity—to check that each generated cell matches expected perturbation effects. On genetic and chemical benchmarks the approach lifts performance on reward-aligned metrics and a held-out metric compared with the base generator, while staying competitive with existing methods on population-level statistics. This shifts the goal from matching overall expression distributions to producing predictions whose single-cell responses receive explicit biological consistency checks.

Core claim

PerturbCellRL frames trustworthy single-cell prediction as verifier-guided generative alignment, where a pretrained flow-matching generator is post-trained via RL so that individual generated cells satisfy cell-level verifiers for Pearson similarity, RMSE proximity, differential-expression Spearman rank, and pathway activity.

What carries the argument

Reinforcement learning post-training that treats four cell-level verifiers (Pearson top-k similarity, RMSE top-k proximity, DE Spearman, Pathway activity) as reward functions to align a pretrained flow-matching generator.

If this is right

  • Improves over the pretrained flow-matching generator on reward-aligned evaluation metrics.
  • Improves on a held-out evaluation metric.
  • Remains competitive with state-of-the-art methods on population-level metrics.
  • Moves single-cell perturbation modeling from distribution matching toward explicit per-cell biological consistency checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verifier-reward structure could be applied to other generative architectures beyond flow matching.
  • Pathway-activity verifiers may transfer to new perturbation classes once the relevant biology is catalogued.
  • Adding or replacing verifiers could target additional single-cell features such as cell-type specificity or temporal dynamics.

Load-bearing premise

The four verifiers accurately capture biological consistency at the single-cell level without introducing systematic bias or overlooking key response features.

What would settle it

Wet-lab experiments that measure actual transcriptional responses of cells to the same perturbations and compare them directly against the scores assigned by the four verifiers on PerturbCellRL outputs.

Figures

Figures reproduced from arXiv: 2606.27752 by Anurendra Kumar, Dongxia Wu, Emily B. Fox, Emma Lundberg, Mingyu Li, Serena Yeung-Levy, Yuhui Zhang.

Figure 1
Figure 1. Figure 1: Overview. Current single-cell perturbation generators can produce implausible individual responses. For example, a generated cell may show perturbation effects inconsistent with the known pathway direction. We design a suite of biologically meaningful verifiers serving in three roles: (1) as evaluators to assess single-cell biological consistency, (2) as reward signals to align generation via RL, and (3) a… view at source ↗
Figure 2
Figure 2. Figure 2: PerturbCellRL Rewards. Pearson top-k and RMSE top-k compare each generated cell with nearby real target cells from the same perturbation condition. The top-k design encourages predictions to lie near the target-cell manifold while preserving cell-level diversity, instead of collapsing all samples to a condition centroid. Pathway activity and DE Spearman evaluate pathway directionality and differential-expr… view at source ↗
Figure 3
Figure 3. Figure 3: PerturbCellRL algorithm. RL post-training seeks to increase the likelihood of high-reward samples and decrease the likelihood of low-reward samples. Therefore, the core training loop of PerturbCellRL consists of inter￾leaved phases of sampling and training. (a) Sampling: we generate multiple rollouts from a fixed control expression and perturbation condition, scoring each with the reward models. (b) Traini… view at source ↗
Figure 4
Figure 4. Figure 4: Norman additive and holdout split proto [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PerturbCellRL post-training performance on Norman additive and holdout settings. We report the four proposed single-cell rewards and held-out single-cell Discrimination Score (DS) over 1600 training steps. Step 0 corresponds to the pretrained scDFM model. Implementation details. The base generator is the public scDFM checkpoint [31], used as the reference model for RL fine-tuning without retraining from sc… view at source ↗
Figure 6
Figure 6. Figure 6: Test-time scaling with the PROGENy pathway verifier. Best-of-N selection improves pathway reward at both the single-cell and population levels. Test-Time Scaling [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Target-fitted UMAP case studies on Norman holdout perturbations. The left, middle, and right panels show cells from the same single-gene perturbation, the same double-gene perturbation, and single-gene perturbations from the same pathway, respectively. Blue, green, and orange densities denote real target cells, scDFM predictions, and PerturbCellRL predictions, respectively [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
read the original abstract

Single-cell perturbation models can reduce costly wet-lab screening by predicting how cells respond transcriptionally to interventions. While recent generative models improve population-level prediction, individual generated cells are not explicitly checked for biological consistency. We introduce PerturbCellRL, a reinforcement learning (RL) framework that post-trains a pretrained single-cell transcriptomic generator using a suite of cell-level verifiers as rewards. These verifiers define four rewards: Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity. The Pathway activity verifier rewards cells whose pathway responses match known perturbation biology. We evaluate PerturbCellRL on multiple genetic and chemical perturbation benchmarks. Across these benchmarks, PerturbCellRL improves over the pretrained flow-matching generator on reward-aligned evaluation metrics and a held-out evaluation metric. Moreover, PerturbCellRL remains competitive with state-of-the-art methods on population-level metrics. Together, these results frame trustworthy single-cell prediction as verifier-guided generative alignment, moving beyond matching expression distributions toward predictions whose single-cell perturbation effects are explicitly checked for biological consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PerturbCellRL, a reinforcement learning framework for post-training a pretrained flow-matching generator on single-cell transcriptomic perturbation data. Rewards are defined by four independent cell-level verifiers (Pearson top-k similarity, RMSE top-k proximity, DE Spearman correlation, and Pathway activity matching known perturbation biology). The central claim is that this verifier-guided alignment yields improvements over the base generator on both reward-aligned metrics and a held-out evaluation metric, while remaining competitive with state-of-the-art methods on population-level statistics across genetic and chemical perturbation benchmarks.

Significance. If the verifiers prove faithful, the work offers a concrete route to enforce single-cell biological consistency in generative perturbation models rather than relying solely on distributional matching. The use of external, independent verifiers is a methodological strength that avoids obvious circularity between reward and evaluation.

major comments (2)
  1. [Abstract (verifier definitions and evaluation claims)] The load-bearing claim that the four verifiers accurately capture biological consistency at the single-cell level without systematic bias is not accompanied by any ablation, sensitivity analysis, or comparison against alternative biological readouts in the provided abstract; this directly affects whether the reported gains on reward-aligned and held-out metrics can be attributed to improved biological fidelity.
  2. [Abstract (results paragraph)] No quantitative results, statistical tests, or data-split details are supplied to support the statements that PerturbCellRL 'improves over the pretrained flow-matching generator' and 'remains competitive with state-of-the-art methods'; without these, the magnitude and robustness of the claimed gains cannot be assessed.
minor comments (1)
  1. The abstract would benefit from naming the specific benchmarks and the identity of the held-out evaluation metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract of our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract (verifier definitions and evaluation claims)] The load-bearing claim that the four verifiers accurately capture biological consistency at the single-cell level without systematic bias is not accompanied by any ablation, sensitivity analysis, or comparison against alternative biological readouts in the provided abstract; this directly affects whether the reported gains on reward-aligned and held-out metrics can be attributed to improved biological fidelity.

    Authors: We agree that the abstract itself does not include ablations, sensitivity analyses, or comparisons to alternative readouts. The full manuscript presents these validations in Sections 4.3 (verifier design and biological grounding) and 5.2 (sensitivity and alternative readout comparisons), where we show the verifiers align with known perturbation biology without evident circularity. Due to abstract length constraints, such details are summarized rather than expanded. We will revise the abstract to include a short clause referencing the validation performed in the main text and supplement. revision: partial

  2. Referee: [Abstract (results paragraph)] No quantitative results, statistical tests, or data-split details are supplied to support the statements that PerturbCellRL 'improves over the pretrained flow-matching generator' and 'remains competitive with state-of-the-art methods'; without these, the magnitude and robustness of the claimed gains cannot be assessed.

    Authors: The abstract omits specific numbers, tests, and split details to preserve readability and emphasize the methodological framing. All quantitative results, including effect sizes, statistical tests, and data-split protocols, appear in Tables 1–3, Figure 2, and Section 3 of the main text. We will revise the abstract to incorporate one or two key quantitative statements (e.g., average improvement on held-out metric) while respecting length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes a standard RL post-training loop that maximizes fixed external verifiers (Pearson top-k, RMSE top-k, DE Spearman, Pathway activity) on a pretrained generator. Reported gains on reward-aligned metrics are expected by construction of RL, but the central claims also include improvement on a held-out metric and competitiveness on independent population-level metrics. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided text that reduce the result to its inputs. The verifiers are defined on external biological criteria and are not shown to be constructed from the same data or loop they evaluate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unexamined assumption that the listed verifiers are valid biological proxies.

pith-pipeline@v0.9.1-grok · 5738 in / 1033 out tokens · 25625 ms · 2026-06-29T05:09:08.427969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 12 canonical work pages · 10 internal anchors

  1. [1]

    Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025

    Abhinav K Adduri, Dhruv Gautam, Beatrice Bevilacqua, Alishba Imran, Rohan Shah, Mohsen Naghipourfar, Noam Teyssier, Rajesh Ilango, Sanjay Nagaraj, Mingze Dong, et al. Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025. 3, 7

  2. [2]

    Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder

    Michael Bereket and Theofanis Karaletsos. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 1–12, 2023. 3

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. 3

  4. [4]

    How to build the virtual cell with artificial intelligence: Priorities and opportunities.Cell, 2024

    Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities.Cell, 2024. 1

  5. [5]

    Learning single-cell perturbation responses using neural optimal transport.Nature methods, 20(11):1759–1768, 2023

    Charlotte Bunne, Stefan G Stark, Gabriele Gut, Jacobo Sarabia Del Castillo, Mitch Levesque, Kjong-Van Lehmann, Lucas Pelkmans, Andreas Krause, and Gunnar R ¨atsch. Learning single-cell perturbation responses using neural optimal transport.Nature methods, 20(11):1759–1768, 2023. 1

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 3, 7

  7. [7]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 3

  8. [8]

    Building the next generation of virtual cells to understand cellular biology.Biophysical Journal, 2023

    Graham T Johnson, Eran Agmon, Matthew Akamatsu, Emma Lundberg, Blair Lyons, Wei Ouyang, Omar A Quintero-Carmona, Megan Riel-Mehan, Susanne Rafelski, and Rick Horwitz. Building the next generation of virtual cells to understand cellular biology.Biophysical Journal, 2023. 1

  9. [9]

    Cellflow enables generative single-cell phenotype modeling with flow matching.bioRxiv, pages 2025–04, 2025

    Dominik Klein, Jonas Simon Fleck, Daniil Bobrovskiy, Lea Zimmermann, S¨oren Becker, Alessandro Palma, Le- ander Dony, Alejandro Tejada-Lapuerta, Guillaume Huguet, Hsiu-Chuan Lin, et al. Cellflow enables generative single-cell phenotype modeling with flow matching.bioRxiv, pages 2025–04, 2025. 1, 3, 7

  10. [10]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 3

  11. [11]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 3, 7

  12. [12]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3

  13. [13]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264,

  14. [14]

    3 10 PerturbCellRLA PREPRINT

  15. [15]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3

  16. [16]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3

  17. [17]

    Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018

    Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018. 3

  18. [18]

    Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology, 19(6):MSB202211517, 2023

    Mohammad Lotfollahi, Anna Klimovskaia Susmelj, Carlo De Donno, Leon Hetzel, Yuge Ji, Ignacio L Ibarra, Sanjay R Srivatsan, Mohsen Naghipourfar, Riza M Daza, Beth Martin, et al. Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology, 19(6):MSB202211517, 2023. 3, 7

  19. [19]

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025. 3

  20. [20]

    Combi-seq for multiplexed transcriptome-based profiling of drug combinations using deterministic barcoding in single-cell droplets.Nature communications, 13(1):4450,

    Lukas Mathur, B Szalai, NH Du, Ramesh Utharala, Martine Ballinger, JJM Landry, M Ryckelynck, Vladimir Benes, Julio Saez-Rodriguez, and Christoph A Merten. Combi-seq for multiplexed transcriptome-based profiling of drug combinations using deterministic barcoding in single-cell droplets.Nature communications, 13(1):4450,

  21. [21]

    Exploring genetic interaction manifolds constructed from rich single-cell phenotypes

    Thomas M Norman, Max A Horlbeck, Joseph M Replogle, Alex Y Ge, Albert Xu, Marco Jost, Luke A Gilbert, and Jonathan S Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455):786–793, 2019. 1, 3, 7

  22. [22]

    scperturb: harmonized single-cell perturbation data.Nature Methods, 21(3):531–540, 2024

    Stefan Peidli, Tessa D Green, Ciyue Shen, Torsten Gross, Joseph Min, Samuele Garda, Bo Yuan, Linus J Schu- macher, Jake P Taylor-King, Debora S Marks, et al. scperturb: harmonized single-cell perturbation data.Nature Methods, 21(3):531–540, 2024. 3, 14

  23. [23]

    Dr.vae: im- proving drug response prediction via modeling of drug perturbation effects.Bioinformatics, 35(19):3743–3751, 03 2019

    Ladislav Ramp ´aˇsek, Daniel Hidru, Petr Smirnov, Benjamin Haibe-Kains, and Anna Goldenberg. Dr.vae: im- proving drug response prediction via modeling of drug perturbation effects.Bioinformatics, 35(19):3743–3751, 03 2019. 3

  24. [24]

    Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq.Cell, 185(14):2559–2575, 2022

    Joseph M Replogle, Reuben A Saunders, Angela N Pogson, Jeffrey A Hussmann, Alexander Lenail, Alina Guna, Lauren Mascibroda, Eric J Wagner, Karen Adelman, Gila Lithwick-Yanai, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq.Cell, 185(14):2559–2575, 2022. 1

  25. [25]

    Predicting transcriptional outcomes of novel multigene per- turbations with gears.Nature Biotechnology, 42(6):927–935, 2024

    Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene per- turbations with gears.Nature Biotechnology, 42(6):927–935, 2024. 3, 7

  26. [26]

    Virtual cell challenge: Toward a turing test for the virtual cell.Cell, 188(13):3370–3374, 2025

    Yusuf H Roohani, Tony J Hua, Po-Yuan Tung, Lexi R Bounds, Feiqiao B Yu, Alexander Dobin, Noam Teyssier, Abhinav Adduri, Alden Woodrow, Brian S Plosky, et al. Virtual cell challenge: Toward a turing test for the virtual cell.Cell, 188(13):3370–3374, 2025. 3

  27. [27]

    Perturbation-response genes reveal signaling footprints in cancer gene expression.Nature communications, 9(1):20, 2018

    Michael Schubert, Bertram Klinger, Martina Kl ¨unemann, Anja Sieber, Florian Uhlitz, Sascha Sauer, Mathew J Garnett, Nils Bl ¨uthgen, and Julio Saez-Rodriguez. Perturbation-response genes reveal signaling footprints in cancer gene expression.Nature communications, 9(1):20, 2018. 2, 5

  28. [28]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 2, 3, 7

  29. [29]

    Systema: a framework for evaluating genetic perturbation response prediction beyond system- atic variation.Nature Biotechnology, pages 1–10, 2025

    Ramon Vi ˜nas Torn´e, Maciej Wiatrak, Zoe Piran, Shuyang Fan, Liangze Jiang, Sarah A Teichmann, Mor Nitzan, and Maria Brbi´c. Systema: a framework for evaluating genetic perturbation response prediction beyond system- atic variation.Nature Biotechnology, pages 1–10, 2025. 2

  30. [30]

    CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

    Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B Fox, and Serena Yeung- Levy. Cellfluxrl: Biologically-constrained virtual cell modeling via reinforcement learning.arXiv preprint arXiv:2603.21743, 2026. 3

  31. [31]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  32. [32]

    scdfm: Distributional flow matching model for robust single-cell perturbation prediction.arXiv preprint arXiv:2602.07103, 2026

    Chenglei Yu, Chuanrui Wang, Bangyan Liao, and Tailin Wu. scdfm: Distributional flow matching model for robust single-cell perturbation prediction.arXiv preprint arXiv:2602.07103, 2026. 1, 3, 7, 8

  33. [33]

    Cellflux: Simulating cellular morphology changes via flow matching

    Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, et al. Cellflux: Simulating cellular morphology changes via flow matching. arXiv preprint arXiv:2502.09775, 2025. 3 11 PerturbCellRLA PREPRINT

  34. [34]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 2, 3, 6 12 PerturbCellRLA PREPRINT Algorithm 1PerturbCellRL: Verifier-Guided RL for scDFM Require:Pretrained scDFM velocit...