SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction
Pith reviewed 2026-05-18 12:23 UTC · model grok-4.3
The pith
Fine-tuning LLMs on synthetic reasoning traces from frontier models yields state-of-the-art results on cellular perturbation prediction and surpasses the source models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying supervised fine-tuning to synthetic reasoning traces generated by frontier models, the SynthPert method enables LLMs to achieve superior performance on the task of predicting cellular responses to genetic perturbations, outperforming the frontier models themselves on the PerturbQA benchmark while requiring only a small portion of quality-filtered data.
What carries the argument
Supervised fine-tuning on synthetic reasoning traces that distill biological knowledge for perturbation prediction.
If this is right
- Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate.
- This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells.
- Performance gains persist despite using only 2% of quality-filtered training data.
Where Pith is reading between the lines
- The distillation technique could extend to other scientific reasoning domains where frontier models can generate step-by-step traces.
- Iterating between improved models and new trace generation might create self-refining loops for domain-specific AI capabilities.
- Lower data requirements from this method could make advanced biological reasoning tools more accessible for virtual cell simulations.
Load-bearing premise
That synthetic reasoning traces from frontier models contain sufficient transferable biological knowledge for effective distillation into other LLMs even when the traces are only partially accurate.
What would settle it
A direct comparison in which a SynthPert fine-tuned model performs no better than or worse than the original frontier model on a fresh set of held-out cellular perturbation experiments would falsify the claim of effective knowledge transfer.
Figures
read the original abstract
Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SynthPert, a method that generates synthetic reasoning traces from frontier LLMs and applies supervised fine-tuning to enhance LLM performance on cellular perturbation prediction using the PerturbQA benchmark. It claims state-of-the-art results that surpass the frontier model used to create the traces, 87% accuracy on unseen RPE1 cells for cross-cell-type generalization, and sustained gains with only 2% of quality-filtered training data, while arguing that partially inaccurate traces can still distill useful biological knowledge.
Significance. If the experimental claims hold under rigorous validation, the work would be significant for AI applications in systems biology by demonstrating a data-efficient distillation approach that can exceed teacher-model performance on structured biological reasoning tasks. This could support more accessible virtual cell modeling and therapeutic discovery, particularly in data-scarce domains, by showing the value of synthetic reasoning traces for domain adaptation.
major comments (2)
- [§4] §4 (PerturbQA experiments): The load-bearing claim that SynthPert surpasses the frontier model requires explicit details on the teacher evaluation protocol. The manuscript must confirm that the teacher was evaluated on the test split using identical prompt format, temperature, and decoding parameters as during trace generation, and that no test examples leaked into the synthetic data creation process; without this, the surpassing result cannot be directly compared.
- [Table 2] Table 2 (RPE1 generalization results): The reported 87% accuracy on unseen RPE1 cells is presented without error bars, standard deviations, or statistical significance tests against baselines. This undermines assessment of whether the cross-cell-type generalization is robust or merely within noise.
minor comments (2)
- [Abstract] Abstract: The phrases '87% accuracy' and '2% of quality-filtered training data' would benefit from immediate parenthetical clarification of the exact metric (e.g., exact-match or F1) and the total size of the unfiltered dataset for context.
- [§3.1] §3.1 (method description): The notation distinguishing synthetic trace generation from the downstream fine-tuning objective could be made more explicit to avoid ambiguity for readers outside LLM distillation literature.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which have helped us strengthen the clarity and rigor of our experimental claims. We address each major comment point by point below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (PerturbQA experiments): The load-bearing claim that SynthPert surpasses the frontier model requires explicit details on the teacher evaluation protocol. The manuscript must confirm that the teacher was evaluated on the test split using identical prompt format, temperature, and decoding parameters as during trace generation, and that no test examples leaked into the synthetic data creation process; without this, the surpassing result cannot be directly compared.
Authors: We agree that transparent details on the teacher evaluation protocol are necessary to substantiate the surpassing claim. In the revised manuscript, we have added a dedicated paragraph in §4.2 (Evaluation Protocol) that explicitly states the following: the frontier model was evaluated on the identical held-out test split using the same prompt template, temperature setting (0.7), and decoding strategy (greedy) as employed during synthetic trace generation. We further confirm that synthetic data creation was performed exclusively on the training portion of PerturbQA, with a strict separation that prevented any test-example leakage. These additions allow direct, apples-to-apples comparison between the fine-tuned model and the teacher. revision: yes
-
Referee: [Table 2] Table 2 (RPE1 generalization results): The reported 87% accuracy on unseen RPE1 cells is presented without error bars, standard deviations, or statistical significance tests against baselines. This undermines assessment of whether the cross-cell-type generalization is robust or merely within noise.
Authors: We acknowledge that the lack of variability measures and statistical tests weakens the presentation of the cross-cell-type results. In the revised manuscript we have updated Table 2 to report mean accuracy ± standard deviation computed across five independent fine-tuning runs with different random seeds. We have also added a statistical analysis subsection in §4.3 that includes paired t-tests against all baselines, with p-values reported in the table caption (all improvements remain significant at p < 0.01). A brief description of the multi-run protocol has been inserted into the table caption for reproducibility. revision: yes
Circularity Check
No circularity: empirical pipeline rests on external benchmark and independent data generation.
full rationale
The paper describes generating synthetic reasoning traces from frontier models, applying supervised fine-tuning, and reporting performance on the external PerturbQA benchmark. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive the central claims. The results are presented as empirical outcomes rather than reductions by construction to the paper's own inputs or prior self-referential definitions. The derivation chain is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic reasoning traces from frontier models contain distillable biological knowledge usable for LLM fine-tuning on perturbation tasks even when partially inaccurate
Forward citations
Cited by 1 Pith paper
-
AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling
AROMA combines text, graph topology, and protein sequences with augmented reasoning and two-stage optimization to deliver more accurate and interpretable predictions of genetic perturbation effects in virtual cells, o...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Deep learning for single-cell genomics: models, challenges and opportunities
Constantin Ahlmann-Eltze and Fabian J Theis. Deep learning for single-cell genomics: models, challenges and opportunities. Nature Methods, 21 0 (1): 0 46--57, 2024
work page 2024
-
[3]
How to build a virtual cell: A roadmap for ai-powered simulation in biology
Christian Bunne, Jacob FV Haim, Simon Mathis, Mohammad Lotfollahi, and Fabian J Theis. How to build a virtual cell: A roadmap for ai-powered simulation in biology. arXiv preprint arXiv:2403.02165, 2024
-
[4]
Yiqun T. Chen and James Zou. GenePT : A Simple But Hard -to- Beat Foundation Model for Genes and Cells Built From ChatGPT . bioRxiv, pp.\ 2023--10, 2023. URL https://www.biorxiv.org/content/10.1101/2023.10.16.562533.abstract. Publisher: Cold Spring Harbor Laboratory
-
[5]
scgpt: toward building a foundation model for single-cell multi-omics using generative ai
Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pp.\ 1--11, 2024
work page 2024
-
[6]
Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder
Bereket Gebregziabher, Leon Hetzel, Anna C Schaar, Fabian J Theis, and Francesco Casale. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder. In The Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[7]
Towards an ai co-scientist for experimental biology
Julian Gottweis, Samuel G Rodriques, Bo Shopsin, David O'Donovan, David GRG Jones, George M Church, and Lucy J Colwell. Towards an ai co-scientist for experimental biology. arXiv preprint arXiv:2407.12648, 2024
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
LoRA : Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[11]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, and ... Survey of hallucination in natural language generation. ACM Computing Surveys, 2023
work page 2023
-
[12]
Di Jin, Eileen Pan, Nassim Oufattole, Wei - Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR, abs/2009.13081, 2020. URL https://arxiv.org/abs/2009.13081
-
[13]
Weinstock, Alexis Battle, and Patrick Cahan
Eric Kernfeld, Yunxiao Yang, Joshua S. Weinstock, Alexis Battle, and Patrick Cahan. A systematic comparison of computational methods for expression forecasting, October 2024. URL https://www.biorxiv.org/content/10.1101/2023.07.28.551039v2. Pages: 2023.07.28.551039 Section: New Results
-
[14]
Maxim V. Kuleshov, Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, Sherry L. Jenkins, Kathleen M. Jagodnik, Alexander Lachmann, Michael G. McDermott, Caroline D. Monteiro, Gregory W. Gundersen, and Avi Ma'ayan. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids ...
-
[15]
Fluctuation structure predicts genome-wide perturbation outcomes
Benjamin Kuznets-Speck, Leon Schwartz, Hanxiao Sun, Madeline E Melzer, Nitu Kumari, Benjamin Haley, Ekta Prashnani, Suriyanarayanan Vaikuntanathan, and Yogesh Goyal. Fluctuation structure predicts genome-wide perturbation outcomes. bioRxiv, pp.\ 2025--06, 2025
work page 2025
-
[16]
LAB-Bench: A comprehensive benchmark for language models in biology
C Laurent, NRLZ Anastacio, A Garriga-Alonso, C Bunne, FJ Theis, et al. LAB-Bench: A comprehensive benchmark for language models in biology . bioRxiv, pp.\ 2024--05, 2024
work page 2024
-
[17]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Learning interoperable representations of single-cell perturbation effects
Romain Lopez, Mohammad Lotfollahi, L Sole-Boldo, D De Donno, ASRR Al-Rawi, AS Jordan, and Fabian J Theis. Learning interoperable representations of single-cell perturbation effects. Nature Biotechnology, 41 0 (6): 0 798--808, 2023
work page 2023
-
[19]
Predicting cellular responses to novel perturbations with generative modeling
Mohammad Lotfollahi, Romain Lopez, F Alexander Wolf, and Fabian J Theis. Predicting cellular responses to novel perturbations with generative modeling. Nature Biotechnology, 41 0 (6): 0 787--797, 2023
work page 2023
-
[20]
Enhancing generative perturbation models with llm-informed gene embeddings
Kaspar M \"a rtens, Rory Donovan-Maiye, and Jesper Ferkinghoff-Borg. Enhancing generative perturbation models with llm-informed gene embeddings. In ICLR 2024 Workshop on Machine Learning for Genomics Explorations, 2024
work page 2024
-
[21]
Transcriptome-wide measurement of complex genetic interaction effects in single cells
Anika Nadig, Joseph M Replogle, Brittania KYL Chan, Alina Guna, S Adrian Scharenberg, Jeffrey A Hussmann, Luke A Gilbert, and Jonathan S Weissman. Transcriptome-wide measurement of complex genetic interaction effects in single cells. Cell, 187 0 (12): 0 2977--2992, 2024
work page 2024
-
[22]
Mapping information-rich genotype--phenotype landscapes with genome-scale perturb-seq
Joseph M Replogle, Reuben A Saunders, Andrew N Pogson, Jeffrey A Hussmann, Alex Lenail, Alina Guna, Lisa Mascibroda, Elana J Wagner, Brittania KYL Chan, Luke A Gilbert, et al. Mapping information-rich genotype--phenotype landscapes with genome-scale perturb-seq. Cell, 185 0 (14): 0 2559--2575, 2022
work page 2022
-
[23]
Predicting transcriptional outcomes of novel multigene perturbations with gears
Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with gears. Nature Biotechnology, 42 0 (6): 0 927--935, 2024
work page 2024
-
[24]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. The STRING database in 2021: customizable protein--protein networks, and functional characterization of user-uploaded gene/measurement sets . Nucleic Acids Research, 49 0 (D1): 0 D605--D612, 2021
work page 2021
-
[26]
The llama 3.1 series of models
The Llama 3.1 Team, Louis-Philippe Morency, Guillaume Grattafiori, Hakan Celebi, Joanna Lee, Maryam Fazel, Nicola Bux, Gido de Jong, Sam Hosseini, et al. The llama 3.1 series of models. arXiv preprint arXiv:2407.19524, 2024
-
[27]
Two-stage fine-tuning with chatgpt data augmentation for learning class-imbalanced data
Taha ValizadehAslani, Yiwen Shi, Jing Wang, Ping Ren, Yi Zhang, Meng Hu, Liang Zhao, and Hualou Liang. Two-stage fine-tuning with chatgpt data augmentation for learning class-imbalanced data. Neurocomputing, 592: 0 127801, 2024. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2024.127801. URL https://www.sciencedirect.com/science/article/pii/S0925231...
-
[28]
Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[29]
Jason Wei, Yi Tay, Rishi Bommasani, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022
work page 2022
-
[30]
Contextualizing perturbation biology with language models
Zijun Wu, Yusuf Roohani, and Jure Leskovec. Contextualizing perturbation biology with language models. arXiv preprint arXiv:2405.15074, 2024
-
[31]
Lima: Less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023
work page 2023
-
[32]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[33]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[34]
n)j ᇭOZ Y& ܹs̙sw /4oL6> F(ZW^yUrМ9s y bj o&O `
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.