pith. sign in

arxiv: 2509.25346 · v2 · submitted 2025-09-29 · 💻 cs.AI · cs.LG· q-bio.CB· q-bio.GN

SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

Pith reviewed 2026-05-18 12:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LGq-bio.CBq-bio.GN
keywords synthetic reasoning tracesLLM fine-tuningcellular perturbation predictionbiological reasoningknowledge distillationPerturbQA benchmarksystems biology
0
0 comments X

The pith

Fine-tuning LLMs on synthetic reasoning traces from frontier models yields state-of-the-art results on cellular perturbation prediction and surpasses the source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SynthPert, which applies supervised fine-tuning to synthetic reasoning traces generated by advanced models in order to strengthen LLM performance on predicting cellular responses to genetic perturbations. Evaluated on the PerturbQA benchmark, the resulting models reach state-of-the-art accuracy and exceed the performance of the frontier model that created the training traces. The work shows that these traces transfer useful biological knowledge even when partially inaccurate, support generalization to new cell types at 87 percent accuracy on unseen RPE1 cells, and produce strong gains from only 2 percent of quality-filtered data. Readers would care because the approach offers a data-efficient route to specialized biological reasoning that could aid therapeutic discovery and virtual cell modeling.

Core claim

By applying supervised fine-tuning to synthetic reasoning traces generated by frontier models, the SynthPert method enables LLMs to achieve superior performance on the task of predicting cellular responses to genetic perturbations, outperforming the frontier models themselves on the PerturbQA benchmark while requiring only a small portion of quality-filtered data.

What carries the argument

Supervised fine-tuning on synthetic reasoning traces that distill biological knowledge for perturbation prediction.

If this is right

  • Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate.
  • This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells.
  • Performance gains persist despite using only 2% of quality-filtered training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distillation technique could extend to other scientific reasoning domains where frontier models can generate step-by-step traces.
  • Iterating between improved models and new trace generation might create self-refining loops for domain-specific AI capabilities.
  • Lower data requirements from this method could make advanced biological reasoning tools more accessible for virtual cell simulations.

Load-bearing premise

That synthetic reasoning traces from frontier models contain sufficient transferable biological knowledge for effective distillation into other LLMs even when the traces are only partially accurate.

What would settle it

A direct comparison in which a SynthPert fine-tuned model performs no better than or worse than the original frontier model on a fresh set of held-out cellular perturbation experiments would falsify the claim of effective knowledge transfer.

Figures

Figures reproduced from arXiv: 2509.25346 by Aditya Misra, Cesar A. Prada-Medina, Josefa Lia Stoisser, Kaspar M\"artens, Lawrence Phillips, Marc Boubnovski Martell, Rory Donovan-Maiye.

Figure 1
Figure 1. Figure 1: Illustration of the SynthPert workflow. Given experimental perturbation data in the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SynthPert, a method that generates synthetic reasoning traces from frontier LLMs and applies supervised fine-tuning to enhance LLM performance on cellular perturbation prediction using the PerturbQA benchmark. It claims state-of-the-art results that surpass the frontier model used to create the traces, 87% accuracy on unseen RPE1 cells for cross-cell-type generalization, and sustained gains with only 2% of quality-filtered training data, while arguing that partially inaccurate traces can still distill useful biological knowledge.

Significance. If the experimental claims hold under rigorous validation, the work would be significant for AI applications in systems biology by demonstrating a data-efficient distillation approach that can exceed teacher-model performance on structured biological reasoning tasks. This could support more accessible virtual cell modeling and therapeutic discovery, particularly in data-scarce domains, by showing the value of synthetic reasoning traces for domain adaptation.

major comments (2)
  1. [§4] §4 (PerturbQA experiments): The load-bearing claim that SynthPert surpasses the frontier model requires explicit details on the teacher evaluation protocol. The manuscript must confirm that the teacher was evaluated on the test split using identical prompt format, temperature, and decoding parameters as during trace generation, and that no test examples leaked into the synthetic data creation process; without this, the surpassing result cannot be directly compared.
  2. [Table 2] Table 2 (RPE1 generalization results): The reported 87% accuracy on unseen RPE1 cells is presented without error bars, standard deviations, or statistical significance tests against baselines. This undermines assessment of whether the cross-cell-type generalization is robust or merely within noise.
minor comments (2)
  1. [Abstract] Abstract: The phrases '87% accuracy' and '2% of quality-filtered training data' would benefit from immediate parenthetical clarification of the exact metric (e.g., exact-match or F1) and the total size of the unfiltered dataset for context.
  2. [§3.1] §3.1 (method description): The notation distinguishing synthetic trace generation from the downstream fine-tuning objective could be made more explicit to avoid ambiguity for readers outside LLM distillation literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have helped us strengthen the clarity and rigor of our experimental claims. We address each major comment point by point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (PerturbQA experiments): The load-bearing claim that SynthPert surpasses the frontier model requires explicit details on the teacher evaluation protocol. The manuscript must confirm that the teacher was evaluated on the test split using identical prompt format, temperature, and decoding parameters as during trace generation, and that no test examples leaked into the synthetic data creation process; without this, the surpassing result cannot be directly compared.

    Authors: We agree that transparent details on the teacher evaluation protocol are necessary to substantiate the surpassing claim. In the revised manuscript, we have added a dedicated paragraph in §4.2 (Evaluation Protocol) that explicitly states the following: the frontier model was evaluated on the identical held-out test split using the same prompt template, temperature setting (0.7), and decoding strategy (greedy) as employed during synthetic trace generation. We further confirm that synthetic data creation was performed exclusively on the training portion of PerturbQA, with a strict separation that prevented any test-example leakage. These additions allow direct, apples-to-apples comparison between the fine-tuned model and the teacher. revision: yes

  2. Referee: [Table 2] Table 2 (RPE1 generalization results): The reported 87% accuracy on unseen RPE1 cells is presented without error bars, standard deviations, or statistical significance tests against baselines. This undermines assessment of whether the cross-cell-type generalization is robust or merely within noise.

    Authors: We acknowledge that the lack of variability measures and statistical tests weakens the presentation of the cross-cell-type results. In the revised manuscript we have updated Table 2 to report mean accuracy ± standard deviation computed across five independent fine-tuning runs with different random seeds. We have also added a statistical analysis subsection in §4.3 that includes paired t-tests against all baselines, with p-values reported in the table caption (all improvements remain significant at p < 0.01). A brief description of the multi-run protocol has been inserted into the table caption for reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline rests on external benchmark and independent data generation.

full rationale

The paper describes generating synthetic reasoning traces from frontier models, applying supervised fine-tuning, and reporting performance on the external PerturbQA benchmark. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive the central claims. The results are presented as empirical outcomes rather than reductions by construction to the paper's own inputs or prior self-referential definitions. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on a domain assumption about the utility of synthetic traces rather than explicit free parameters or new invented entities. No numerical hyperparameters or post-hoc fitting procedures are described in the abstract.

axioms (1)
  • domain assumption Synthetic reasoning traces from frontier models contain distillable biological knowledge usable for LLM fine-tuning on perturbation tasks even when partially inaccurate
    This premise is invoked to explain why the approach works and is listed as one of the three key insights.

pith-pipeline@v0.9.0 · 5750 in / 1346 out tokens · 50462 ms · 2026-05-18T12:23:41.982006+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

    q-bio.QM 2026-04 unverdicted novelty 5.0

    AROMA combines text, graph topology, and protein sequences with augmented reasoning and two-stage optimization to deliver more accurate and interpretable predictions of genetic perturbation effects in virtual cells, o...

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Deep learning for single-cell genomics: models, challenges and opportunities

    Constantin Ahlmann-Eltze and Fabian J Theis. Deep learning for single-cell genomics: models, challenges and opportunities. Nature Methods, 21 0 (1): 0 46--57, 2024

  3. [3]

    How to build a virtual cell: A roadmap for ai-powered simulation in biology

    Christian Bunne, Jacob FV Haim, Simon Mathis, Mohammad Lotfollahi, and Fabian J Theis. How to build a virtual cell: A roadmap for ai-powered simulation in biology. arXiv preprint arXiv:2403.02165, 2024

  4. [4]

    Chen and James Zou

    Yiqun T. Chen and James Zou. GenePT : A Simple But Hard -to- Beat Foundation Model for Genes and Cells Built From ChatGPT . bioRxiv, pp.\ 2023--10, 2023. URL https://www.biorxiv.org/content/10.1101/2023.10.16.562533.abstract. Publisher: Cold Spring Harbor Laboratory

  5. [5]

    scgpt: toward building a foundation model for single-cell multi-omics using generative ai

    Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pp.\ 1--11, 2024

  6. [6]

    Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder

    Bereket Gebregziabher, Leon Hetzel, Anna C Schaar, Fabian J Theis, and Francesco Casale. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder. In The Twelfth International Conference on Learning Representations (ICLR), 2024

  7. [7]

    Towards an ai co-scientist for experimental biology

    Julian Gottweis, Samuel G Rodriques, Bo Shopsin, David O'Donovan, David GRG Jones, George M Church, and Lucy J Colwell. Towards an ai co-scientist for experimental biology. arXiv preprint arXiv:2407.12648, 2024

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    Hayou, N

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models, 2024. URL https://arxiv.org/abs/2402.12354

  10. [10]

    LoRA : Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  11. [11]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, and ... Survey of hallucination in natural language generation. ACM Computing Surveys, 2023

  12. [12]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

    Di Jin, Eileen Pan, Nassim Oufattole, Wei - Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR, abs/2009.13081, 2020. URL https://arxiv.org/abs/2009.13081

  13. [13]

    Weinstock, Alexis Battle, and Patrick Cahan

    Eric Kernfeld, Yunxiao Yang, Joshua S. Weinstock, Alexis Battle, and Patrick Cahan. A systematic comparison of computational methods for expression forecasting, October 2024. URL https://www.biorxiv.org/content/10.1101/2023.07.28.551039v2. Pages: 2023.07.28.551039 Section: New Results

  14. [14]

    Kuleshov, Matthew R

    Maxim V. Kuleshov, Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L. Jenkins, Kathleen M. Jagodnik, Alexander Lachmann, Michael G. McDermott, Caroline D. Monteiro, Gregory W. Gundersen, and Avi Ma'ayan. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids ...

  15. [15]

    Fluctuation structure predicts genome-wide perturbation outcomes

    Benjamin Kuznets-Speck, Leon Schwartz, Hanxiao Sun, Madeline E Melzer, Nitu Kumari, Benjamin Haley, Ekta Prashnani, Suriyanarayanan Vaikuntanathan, and Yogesh Goyal. Fluctuation structure predicts genome-wide perturbation outcomes. bioRxiv, pp.\ 2025--06, 2025

  16. [16]

    LAB-Bench: A comprehensive benchmark for language models in biology

    C Laurent, NRLZ Anastacio, A Garriga-Alonso, C Bunne, FJ Theis, et al. LAB-Bench: A comprehensive benchmark for language models in biology . bioRxiv, pp.\ 2024--05, 2024

  17. [17]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

  18. [18]

    Learning interoperable representations of single-cell perturbation effects

    Romain Lopez, Mohammad Lotfollahi, L Sole-Boldo, D De Donno, ASRR Al-Rawi, AS Jordan, and Fabian J Theis. Learning interoperable representations of single-cell perturbation effects. Nature Biotechnology, 41 0 (6): 0 798--808, 2023

  19. [19]

    Predicting cellular responses to novel perturbations with generative modeling

    Mohammad Lotfollahi, Romain Lopez, F Alexander Wolf, and Fabian J Theis. Predicting cellular responses to novel perturbations with generative modeling. Nature Biotechnology, 41 0 (6): 0 787--797, 2023

  20. [20]

    Enhancing generative perturbation models with llm-informed gene embeddings

    Kaspar M \"a rtens, Rory Donovan-Maiye, and Jesper Ferkinghoff-Borg. Enhancing generative perturbation models with llm-informed gene embeddings. In ICLR 2024 Workshop on Machine Learning for Genomics Explorations, 2024

  21. [21]

    Transcriptome-wide measurement of complex genetic interaction effects in single cells

    Anika Nadig, Joseph M Replogle, Brittania KYL Chan, Alina Guna, S Adrian Scharenberg, Jeffrey A Hussmann, Luke A Gilbert, and Jonathan S Weissman. Transcriptome-wide measurement of complex genetic interaction effects in single cells. Cell, 187 0 (12): 0 2977--2992, 2024

  22. [22]

    Mapping information-rich genotype--phenotype landscapes with genome-scale perturb-seq

    Joseph M Replogle, Reuben A Saunders, Andrew N Pogson, Jeffrey A Hussmann, Alex Lenail, Alina Guna, Lisa Mascibroda, Elana J Wagner, Brittania KYL Chan, Luke A Gilbert, et al. Mapping information-rich genotype--phenotype landscapes with genome-scale perturb-seq. Cell, 185 0 (14): 0 2559--2575, 2022

  23. [23]

    Predicting transcriptional outcomes of novel multigene perturbations with gears

    Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with gears. Nature Biotechnology, 42 0 (6): 0 927--935, 2024

  24. [24]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  25. [25]

    The STRING database in 2021: customizable protein--protein networks, and functional characterization of user-uploaded gene/measurement sets

    Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. The STRING database in 2021: customizable protein--protein networks, and functional characterization of user-uploaded gene/measurement sets . Nucleic Acids Research, 49 0 (D1): 0 D605--D612, 2021

  26. [26]

    The llama 3.1 series of models

    The Llama 3.1 Team, Louis-Philippe Morency, Guillaume Grattafiori, Hakan Celebi, Joanna Lee, Maryam Fazel, Nicola Bux, Gido de Jong, Sam Hosseini, et al. The llama 3.1 series of models. arXiv preprint arXiv:2407.19524, 2024

  27. [27]

    Two-stage fine-tuning with chatgpt data augmentation for learning class-imbalanced data

    Taha ValizadehAslani, Yiwen Shi, Jing Wang, Ping Ren, Yi Zhang, Meng Hu, Liang Zhao, and Hualou Liang. Two-stage fine-tuning with chatgpt data augmentation for learning class-imbalanced data. Neurocomputing, 592: 0 127801, 2024. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2024.127801. URL https://www.sciencedirect.com/science/article/pii/S0925231...

  28. [28]

    Graph attention networks

    Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018

  29. [29]

    Chi, Quoc V

    Jason Wei, Yi Tay, Rishi Bommasani, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022

  30. [30]

    Contextualizing perturbation biology with language models

    Zijun Wu, Yusuf Roohani, and Jure Leskovec. Contextualizing perturbation biology with language models. arXiv preprint arXiv:2405.15074, 2024

  31. [31]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023

  32. [32]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  33. [33]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  34. [34]

    n޼)j ᇭOZ Y& ܹs̙sw /4oL6> F(ZW^yUrМ9s y bj o&O `

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...