How Post-Training Shapes Biological Reasoning Models

Bryan Perozzi; Eric Wang; Hanlin Zhang; Lukas Fesser; Marinka Zitnik; Michelle M. Li; Sham M. Kakade; Shekoofeh Azizi

arxiv: 2606.16517 · v2 · pith:6C7TQIUPnew · submitted 2026-06-15 · 💻 cs.LG · q-bio.QM

How Post-Training Shapes Biological Reasoning Models

Lukas Fesser , Hanlin Zhang , Michelle M. Li , Eric Wang , Bryan Perozzi , Shekoofeh Azizi , Sham M. Kakade , Marinka Zitnik This is my paper

Pith reviewed 2026-07-01 07:46 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords biological reasoning modelspost-trainingsupervised fine-tuningreinforcement learninggeneralizationin-domain out-of-domain trade-offgenomicsproteins

0 comments

The pith

Biological reasoning models improve most when post-training stages are composed specifically rather than scaled uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and evaluates over 100 models across genomics, transcriptomics, and proteins while varying continued pre-training, supervised fine-tuning, and reinforcement learning. It finds that continued pre-training aligns models with biological language and lifts downstream results, supervised fine-tuning raises in-domain scores but makes out-of-domain performance peak early and then fall as the model overfits the training distribution, and reinforcement learning applied after strong supervised checkpoints recovers some lost generalization. Overall performance therefore depends on the ordering and relative size of these stages under a fixed budget, not on adding more of any single stage. A sympathetic reader cares because this shows that the common assumption of monotonic gains from extra supervision or compute does not hold for scientific reasoning in biology.

Core claim

Biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.

What carries the argument

Controlled variation of backbone, continued pre-training, supervised fine-tuning, and reinforcement learning while separately tracking in-domain and out-of-domain performance on genomics, transcriptomics, and protein tasks.

If this is right

Continued pre-training improves downstream performance by aligning models with biological language.
Supervised fine-tuning consistently increases in-domain performance but causes out-of-domain performance to peak early and decline.
Reinforcement learning applied to strong supervised fine-tuning checkpoints with aligned rewards improves out-of-domain performance and partially recovers generalization.
Under fixed post-training budgets the strongest in-domain to out-of-domain trade-off arises from brief supervised fine-tuning, larger reinforcement learning allocations, and asymmetric adaptation capacity across stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stage-ordering effects observed here may appear when building reasoning models for other scientific domains such as chemistry.
Dynamic switching from supervised fine-tuning to reinforcement learning once out-of-domain metrics stop rising could be tested on models of different sizes.
Allocating more parameter change in later stages than earlier ones might improve results in other multimodal scientific foundation models.

Load-bearing premise

The chosen in-domain and out-of-domain tasks and performance metrics accurately reflect true biological reasoning capabilities and generalization without being confounded by the specific data distributions, task designs, or evaluation protocols used in the controlled experiments.

What would settle it

Training additional models with prolonged supervised fine-tuning past the observed peak and checking whether out-of-domain scores continue to decline or instead stabilize or recover.

Figures

Figures reproduced from arXiv: 2606.16517 by Bryan Perozzi, Eric Wang, Hanlin Zhang, Lukas Fesser, Marinka Zitnik, Michelle M. Li, Sham M. Kakade, Shekoofeh Azizi.

**Figure 1.** Figure 1: Training dynamics define distinct generalization regimes in biological reasoning models. We compare backbone choice, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) across genomics, transcriptomics, and protein tasks, and evaluate each stage on biologically meaningful in-domain (ID) and out-of-domain (OOD) splits. remain poorly understood across training stages [… view at source ↗

**Figure 2.** Figure 2: Supervised fine-tuning improves in-domain performance but reduces out-of-domain robustness. As SFT compute increases, ID performance continues to improve, while OOD performance peaks early and declines, indicating over-specialization to the training data. DNA/ RNA mean and std. over 3 random seeds (we only use one seed for Proteins, due to the size of the dataset). 4.1 Supervised Fine-Tuning Increases Accu… view at source ↗

**Figure 3.** Figure 3: Increasing data improves generalization more reliably than increasing SFT epochs. Scaling dataset size yields gains in both ID and OOD performance, but with diminishing returns, in contrast to the overfitting behavior observed when scaling epochs. These results suggest that SFT is a strong driver of in-domain biological reasoning, but that scaling it naively, either through more epochs or more data, does n… view at source ↗

**Figure 4.** Figure 4: Reinforcement learning consistently improves out-of-domain robustness. Starting from strong SFT checkpoints, RL increases both ID and OOD performance, with the largest gains in OOD and diminishing returns after the first few epochs. DNA/ RNA mean and std. over 3 random seeds (we only use one seed for Proteins, due to the size of the dataset). Scaling RL epochs. We now ask whether reinforcement learning can… view at source ↗

**Figure 5.** Figure 5: Continued pre-training improves the effectiveness of downstream post-training. CPT improves both SFT and RL performance, with the largest gains appearing after RL and in out-of-domain settings. We next study whether continued pre-training changes how much downstream post-training can help. In the DNA and RNA settings, we first adapt the base backbones with continued pre-training on biological texts, yieldi… view at source ↗

**Figure 6.** Figure 6: Stronger backbones improve performance achievable with post-training but preserve training dynamics. G-R does not display an initial drop in performance when starting RL, unlike Q1-R and generally performs better OOD. Mean and std. over 3 random seeds. To test whether our main findings depend on the choice of base model, we repeat the RNA experiments with an off-the-shelf backbone Gemma model [76]. In add… view at source ↗

**Figure 7.** Figure 7: Optimal adaptation requires asymmetric capacity across training stages. Higher LoRA rank benefits SFT, while lower rank is sufficient for RL, indicating that different stages require different adaptation capacity (both for ID and OOD tasks). Shown are results for drug target identification (RNA) tasks. We further study how adaptation capacity should be allocated across post-training stages by running a joi… view at source ↗

**Figure 8.** Figure 8: Under a fixed post-training budget, a small amount of SFT followed by more RL gives the best ID-OOD trade-off. Across DNA and RNA, 1–3 SFT epochs followed by larger RL budgets generally give the strongest OOD accuracy, while larger SFT allocations achieve better ID performance. Finally, we study how to allocate posttraining across supervised fine-tuning and reinforcement learning. In this experiment, we … view at source ↗

**Figure 9.** Figure 9: RL shifts the ID-OOD frontier across modalities. Each point is a trained checkpoint; color denotes training stage and marker shape denotes backbone. RL generally improves OOD performance at comparable ID performance across DNA, RNA, and protein tasks. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

read the original abstract

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that sequencing brief SFT then heavier RL gives better ID-OOD trade-offs than uniform scaling in these bio models, shown across 100+ controlled runs.

read the letter

The main takeaway is that post-training for biological reasoning models is not monotonic: SFT lifts ID scores but makes OOD peak early and then drop, while RL on strong SFT checkpoints can recover some OOD performance. Under fixed budgets the best balance comes from short SFT plus larger RL allocations.

What is new is the scale and control. They varied backbone, CPT, SFT, and RL across more than 100 models on genomics, transcriptomics, and protein tasks, measuring both ID and OOD. The distinct per-stage effects and the non-monotonic pattern are concrete observations that go beyond standard fine-tuning advice.

The work does a reasonable job documenting that each stage reshapes generalization differently rather than adding uniform gains. The claim that performance depends on composition rather than total compute is supported by the controlled setup described.

The soft spot is the lack of detail on exact ID/OOD task splits and metrics in the abstract. If those splits share latent features or if the metrics mainly reward surface patterns, the reported stage interactions could be narrower than claimed. The stress-test concern about task design is worth checking in the full methods; without error bars or full tables it is also hard to judge how stable the early peaks are.

This is for people building or tuning multimodal bio foundation models who need practical rules for allocating post-training stages. A reader working on generalization in scientific domains would find the empirical patterns useful.

It deserves peer review because the experiment count is large enough to make the stage-composition claims worth detailed checking, even with the open questions on task validity.

Referee Report

1 major / 2 minor

Summary. The paper claims that post-training stages (CPT, SFT, RL) for biological reasoning models (combining LMs with multimodal bio foundation models on DNA/RNA/proteins) reshape ID and OOD generalization distinctly rather than adding uniform gains. CPT aligns models with biological language; SFT boosts ID but causes OOD to peak then decline; RL on strong SFT checkpoints with aligned rewards improves OOD and recovers generalization. Biological reasoning is non-monotonic with supervision/compute; under fixed budgets, best ID-OOD trade-offs arise from brief SFT, larger RL allocations, and asymmetric adaptation. This is supported by controlled experiments training/evaluating >100 models across genomics, transcriptomics, and proteins with variation in backbones and stages.

Significance. If the empirical results hold, the work is significant for highlighting that post-training effects on generalization in biological reasoning models are stage-specific and non-monotonic, rather than simply scaling with more data or compute. This could inform practical training strategies for such models. The scale of controlled experiments across >100 models with explicit stage variations is a clear strength, providing a systematic empirical basis for the claims about composition effects.

major comments (1)

[Abstract] Abstract and implied experimental design: The central claims—that each stage reshapes generalization distinctly, SFT induces OOD peak-and-decline, and optimal trade-offs come from brief SFT + larger RL—rest entirely on the chosen ID/OOD tasks and metrics faithfully measuring biological reasoning and generalization. No details are provided on how ID/OOD splits were constructed, whether they share latent distributional features, or how metrics were validated to isolate multi-step inference rather than surface pattern matching. This is load-bearing, as confounds in task design or data distributions could artifactually produce the reported non-monotonicity and stage-composition effects.

minor comments (2)

[Abstract] The abstract and results lack reporting of error bars, exact dataset sizes, full results tables, and precise metric definitions, which are needed to assess reproducibility and support the claims about distinct per-stage effects.
Notation for ID/OOD performance and stage allocations could be clarified with explicit definitions or a table summarizing the controlled variations across the >100 models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and the scale of our controlled experiments. We address the major comment on task design and experimental details below, and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and implied experimental design: The central claims—that each stage reshapes generalization distinctly, SFT induces OOD peak-and-decline, and optimal trade-offs come from brief SFT + larger RL—rest entirely on the chosen ID/OOD tasks and metrics faithfully measuring biological reasoning and generalization. No details are provided on how ID/OOD splits were constructed, whether they share latent distributional features, or how metrics were validated to isolate multi-step inference rather than surface pattern matching. This is load-bearing, as confounds in task design or data distributions could artifactually produce the reported non-monotonicity and stage-composition effects.

Authors: We agree this is a substantive point and that the manuscript would benefit from expanded details on task construction to rule out potential confounds. In the revision, we will add a new subsection (Section 3.2) and appendix with: explicit descriptions of ID/OOD split construction for each domain (e.g., holding out specific species, sequence motifs, or functional categories to enforce distributional shift); quantitative checks confirming minimal overlap in latent features (via embedding similarity and motif analysis); and metric validation steps including ablation experiments and expert review to demonstrate that tasks require multi-step inference beyond surface patterns. These additions will directly support the non-monotonicity claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of post-training effects

full rationale

The paper reports results from training and evaluating >100 models under controlled variations in CPT, SFT, and RL stages, directly measuring ID and OOD performance on genomics/transcriptomics/protein tasks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist; the central claims about non-monotonicity and stage-composition effects follow immediately from the reported experimental outcomes without reduction to inputs by construction. The work is self-contained empirical science with no self-definitional, uniqueness-imported, or ansatz-smuggled steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical ablation study; its claims rest primarily on the validity of the chosen performance metrics and experimental controls rather than new mathematical axioms or postulated entities. No free parameters or invented entities are introduced in the reported findings.

axioms (1)

domain assumption The in-domain and out-of-domain performance metrics validly measure biological reasoning ability and generalization.
The paper draws conclusions about how stages reshape reasoning from these metrics; if the metrics do not capture the intended capabilities, the stage-specific effects would not hold.

pith-pipeline@v0.9.1-grok · 5783 in / 1497 out tokens · 40836 ms · 2026-07-01T07:46:44.592533+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 28 canonical work pages · 5 internal anchors

[1]

Maddison, and Bo Wang

Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, et al. Bioreason: Incentivizing 10 multimodal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579, 2025

work page arXiv 2025
[2]

rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos. rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

2025
[3]

Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

Adibvafa Fallahpour, Arman Seyed-Ahmadi, Parsa Idehpour, Omar Ibrahim, Purav Gupta, Jack Naimer, Kevin Zhu, Arnav Shah, Shihao Ma, Abhinav Adduri, et al. Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

2026
[4]

Evolm: In search of lost language model training dynamics

Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric P Xing, Sham M Kakade, and Hanlin Zhang. Evolm: In search of lost language model training dynamics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[5]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[6]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv:2504.20571, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? NeurIPS, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? NeurIPS, 2025

2025
[8]

arXiv preprint arXiv:2507.16812 , year=

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning.arXiv:2507.16812, 2025

work page arXiv 2025
[9]

OpenThoughts: data recipes for reasoning models.ICLR, 2026

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. OpenThoughts: data recipes for reasoning models.ICLR, 2026

2026
[10]

Scaling large language models for next-generation single-cell analysis.BioRxiv, pages 2025–04, 2026

Syed Asad Rizvi, Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Curtis Jamison Perry, Ivan Vrkic, Nicole Mayerli Constante, Zirui Fu, Sizhuang He, et al. Scaling large language models for next-generation single-cell analysis.BioRxiv, pages 2025–04, 2026

2025
[11]

Chang Yu, Siyuan Li, Zicheng Liu, Jingbo Zhou, Xianglong Guo, Kai Yu, Yuqing Zhou, Ken Li, Zelin Zang, Zhen Lei, and Stan Z. Li. CDBridge: A cross-omics post-training bridge strategy for context-aware biological modeling. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Hk4Fb6kaYF

2026
[12]

Unleashing scientific reasoning for bio-experimental protocol generation via structured component-based reward mechanism.ICLR, 2026

Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu, Zekai Lin, Lilong Wang, Wenjie Lou, Lei Liu, Lei Bai, et al. Unleashing scientific reasoning for bio-experimental protocol generation via structured component-based reward mechanism.ICLR, 2026

2026
[13]

Sci-verifier: Scientific verifier with thinking.ICLR, 2026

Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, Lei Bai, Ganqu Cui, et al. Sci-verifier: Scientific verifier with thinking.ICLR, 2026

2026
[14]

Cellduality: Un- locking biological reasoning in LLMs with self-supervised RLVR

Yuhang Chen, Zhen Tan, Ruichen Zhang, Mufan Qiu, and Tianlong Chen. Cellduality: Un- locking biological reasoning in LLMs with self-supervised RLVR. InThe Fourteenth Interna- tional Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= I4meJN28Ol

2026
[15]

VCWorld: a biological world model for virtual cell simulation.ICLR, 2026

Zhijian Wei, Runze Ma, Zichen Wang, Zhongmin Li, Shuotong Song, and Shuangjia Zheng. VCWorld: a biological world model for virtual cell simulation.ICLR, 2026

2026
[16]

Helix: Evolutionary reinforcement learning for open-ended scientific problem solving.ICLR, 2026

Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, and Jun Zhu. Helix: Evolutionary reinforcement learning for open-ended scientific problem solving.ICLR, 2026. 11

2026
[17]

Reshaping reasoning in llms: A theoretical analysis of rl training dynamics through pattern selection.ICLR, 2026

Xingwu Chen, Tianle Li, and Difan Zou. Reshaping reasoning in llms: A theoretical analysis of rl training dynamics through pattern selection.ICLR, 2026

2026
[18]

Training dynamics impact post- training quantization robustness.ICLR, 2026

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness.ICLR, 2026

2026
[19]

The coverage principle: How pre-training enables post-training.ICLR, 2026

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T Ash, Akshay Krishnamurthy, and Dylan J Foster. The coverage principle: How pre-training enables post-training.ICLR, 2026

2026
[20]

Benchmarking algorithms for generalizable single-cell perturbation response prediction.Nature Methods, 23(2):451–464, 2026

Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Ping Li, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction.Nature Methods, 23(2):451–464, 2026

2026
[21]

A fully automated benchmarking suite to compare macromolecular complexes.Nature Methods, 23(2):387–394, 2026

Gabriel Studer, Xavier Robin, Stefan Bienert, Janani Durairaj, Peter Škrinjar, Gerardo Tauriello, Andrew Mark Waterhouse, and Torsten Schwede. A fully automated benchmarking suite to compare macromolecular complexes.Nature Methods, 23(2):387–394, 2026

2026
[22]

PLINDER: the protein-ligand interactions dataset and evaluation resource.BioRxiv, pages 2024–07, 2024

Janani Durairaj, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, et al. PLINDER: the protein-ligand interactions dataset and evaluation resource.BioRxiv, pages 2024–07, 2024

2024
[23]

ProCyon: a multimodal foundation model for protein phenotypes.BioRxiv, pages 2024–12, 2025

Owen Queen, Yepeng Huang, Robert Calef, Valentina Giunchiglia, Tianlong Chen, George Dasoulas, LeAnn Tai, Gianmarco Abbadessa, Owain Howell, Michelle M Li, et al. ProCyon: a multimodal foundation model for protein phenotypes.BioRxiv, pages 2024–12, 2025

2024
[24]

Evaluating generalizability of artificial intelligence models for molecular datasets

Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian G Marin, Marinka Zitnik, and Maha Farhat. Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence, 6(12):1512–1524, 2024

2024
[25]

Zero-shot evaluation reveals limitations of single-cell foundation models.Genome Biology, 26(1):101, 2025

Kasia Z Kedzierska, Lorin Crawford, Ava P Amini, and Alex X Lu. Zero-shot evaluation reveals limitations of single-cell foundation models.Genome Biology, 26(1):101, 2025

2025
[26]

Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661, 2025

Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661, 2025

2025
[27]

LoongRL: reinforcement learning for advanced reasoning over long contexts.ICLR, 2026

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, and Mao Yang. LoongRL: reinforcement learning for advanced reasoning over long contexts.ICLR, 2026

2026
[28]

The art of scaling reinforcement learning compute for llms.ICLR, 2026

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.ICLR, 2026

2026
[29]

Rethinking LLM reasoning: From explicit trajectories to latent representations

Cong Jiang, Xiaofeng Zhang, Fangzhi Zhu, XiaoWei Chen, Junxiong Zhu, and Zheng Zhang. Rethinking LLM reasoning: From explicit trajectories to latent representations. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=CbK7lYbmv8

2026
[30]

CoT-Evo: evolutionary distillation of chain-of-thought for scientific reasoning.ICLR, 2026

Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, and Huajun Chen. CoT-Evo: evolutionary distillation of chain-of-thought for scientific reasoning.ICLR, 2026

2026
[31]

scPilot: Large language model reasoning toward automated single-cell analysis and discovery.NeurIPS, 2025

Yiming Gao, Zhen Wang, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Jieyuan Liu, Enze Ma, Zhiting Hu, et al. scPilot: Large language model reasoning toward automated single-cell analysis and discovery.NeurIPS, 2025

2025
[32]

AI-researcher: autonomous scientific innovation.NeurIPS, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: autonomous scientific innovation.NeurIPS, 2025

2025
[33]

Training a scientific reasoning model for chemistry.NeurIPS, 2025

Siddharth M Narayanan, James D Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G Rodriques, and Andrew D White. Training a scientific reasoning model for chemistry.NeurIPS, 2025. 12

2025
[34]

Language models for biological research: a primer.Nature Methods, 21(8):1422–1429, 2024

Elana Simon, Kyle Swanson, and James Zou. Language models for biological research: a primer.Nature Methods, 21(8):1422–1429, 2024

2024
[35]

Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

2021
[36]

Step 1–Step 7

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023

work page arXiv 2023
[37]

Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

2024
[38]

Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

2026
[39]

Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P De Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

2025
[40]

Alphagenome: advancing regulatory variant effect prediction with a unified dna sequence model.BioRxiv, pages 2025–06, 2025

Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, et al. Alphagenome: advancing regulatory variant effect prediction with a unified dna sequence model.BioRxiv, pages 2025–06, 2025

2025
[41]

The omg dataset: An open metagenomic corpus for mixed-modality genomic language modeling.bioRxiv, pages 2024–08, 2024

Andre Cornman, Jacob West-Roberts, Antonio Pedro Camargo, Simon Roux, Martin Bera- cochea, Milot Mirdita, Sergey Ovchinnikov, and Yunha Hwang. The omg dataset: An open metagenomic corpus for mixed-modality genomic language modeling.bioRxiv, pages 2024–08, 2024

2024
[42]

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

Yusen Hou, Weicai Long, Haitao Hu, Houcheng Su, Junning Feng, and Yanlin Zhang. Phagebench: Can llms understand raw bacteriophage genomes?arXiv preprint arXiv:2604.05775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Orthrus: toward evolutionary and functional rna foundation models.Nature Methods, pages 1–11, 2026

Philip Fradkin, Ruian “Ian” Shi, Taykhoom Dalal, Keren Isaev, Brendan J Frey, Leo J Lee, Quaid Morris, and Bo Wang. Orthrus: toward evolutionary and functional rna foundation models.Nature Methods, pages 1–11, 2026

2026
[44]

Interpretable rna foundation model from unannotated data for highly accu- rate rna structure and function predictions.arXiv preprint arXiv:2204.00300,

Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, et al. Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions.arXiv preprint arXiv:2204.00300, 2022

work page arXiv 2022
[45]

A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pages 2025–04, 2025

James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, et al. A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pages 2025–04, 2025

2025
[46]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21(8):1470–1480, 2024

2024
[47]

Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

Christina V Theodoris, Ling Xiao, Anant Chopra, Mark D Chaffin, Zeina R Al Sayed, Matthew C Hill, Helene Mantineo, Elizabeth M Brydon, Zexian Zeng, X Shirley Liu, et al. Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

2023
[48]

Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025

Abhinav K Adduri, Dhruv Gautam, Beatrice Bevilacqua, Alishba Imran, Rohan Shah, Mohsen Naghipourfar, Noam Teyssier, Rajesh Ilango, Sanjay Nagaraj, Mingze Dong, et al. Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025. 13

2025
[49]

Large-scale foundation model on single-cell transcriptomics.Nature methods, 21(8):1481–1491, 2024

Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Yucheng Guo, Xingyi Cheng, Taifeng Wang, Jianzhu Ma, Xuegong Zhang, and Le Song. Large-scale foundation model on single-cell transcriptomics.Nature methods, 21(8):1481–1491, 2024

2024
[50]

scgenept: Is language all you need for modeling single-cell perturbations?bioRxiv, pages 2024–10, 2024

Ana-Maria Istrate, Donghui Li, and Theofanis Karaletsos. scgenept: Is language all you need for modeling single-cell perturbations?bioRxiv, pages 2024–10, 2024

2024
[51]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

2023
[52]

Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

2025
[53]

Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

2023
[54]

Unified rational protein engineering with sequence-based deep representation learning

Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019

2019
[55]

Multimodal learning enables chat-based exploration of single-cell data.Nature Biotechnology, pages 1–11, 2025

Moritz Schaefer, Peter Peneder, Daniel Malzl, Salvo Danilo Lombardo, Mihaela Peycheva, Jake Burton, Anna Hakobyan, Varun Sharma, Thomas Krausgruber, Celine Sin, et al. Multimodal learning enables chat-based exploration of single-cell data.Nature Biotechnology, pages 1–11, 2025

2025
[56]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[57]

D-cpt law: Domain-specific continual pre-training scaling law for large language models.Advances in Neural Information Processing Systems, 37:90318–90354, 2024

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, et al. D-cpt law: Domain-specific continual pre-training scaling law for large language models.Advances in Neural Information Processing Systems, 37:90318–90354, 2024

2024
[58]

Understanding the effects of RLHF on LLM generalisation and diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. InInternational Conference on Learning Representations, 2024

2024
[59]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360, 2020

2020
[60]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[61]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems, 2022

2022
[62]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2023

2023
[63]

When scaling meets LLM finetuning: The effect of data, model and finetuning method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. InInternational Conference on Learning Representations, 2024. 14

2024
[64]

arXiv preprint arXiv:2308.04014 , year=

Kshitij Gupta, Dan Iter, and Daniel Hershcovich. Continual pre-training of large language models: How to (re)warm your model?arXiv preprint arXiv:2308.04014, 2023

work page arXiv 2023
[65]

arXiv preprint arXiv:2403.08763 , year=

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, and Irina Rish. Simple and scalable strategies to continually pre-train large language models.arXiv preprint arXiv:2403.08763, 2024

work page arXiv 2024
[66]

arXiv preprint arXiv:2407.07263 , year=

Jupinder Parmar, Sanjev Prabhu, Suchin Gururangan, Hailey Awadalla, Shaden Smith, and Niklas Muennighoff. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263, 2024

work page arXiv 2024
[67]

Continual pre-training of language models

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models. InInternational Conference on Learning Representations, 2023

2023
[68]

Adapting large language models via reading comprehension.arXiv preprint arXiv:2309.09530, 2024

Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models via reading comprehension.arXiv preprint arXiv:2309.09530, 2024

work page arXiv 2024
[69]

Composer 2 technical report, 2026

Cursor Research et al. Composer 2 technical report, 2026. URL https://arxiv.org/abs/2603. 24477

2026
[70]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. InInternational Conference on Machine Learning, pages 10818–10838. PMLR, 2025

2025
[71]

Gene-r1: Reasoning with data- augmented lightweight llms for gene set analysis

Zhizheng Wang, Yifan Yang, Qiao Jin, and Zhiyong Lu. Gene-r1: Reasoning with data- augmented lightweight llms for gene set analysis. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 494–507. World Scientific, 2025

2026
[72]

Toward scientific rea- soning in llms: Training from expert discussions via reinforcement learning.arXiv preprint arXiv:2505.19501, 2025

Ming Yin, Yuanhao Qu, Ling Yang, Le Cong, and Mengdi Wang. Toward scientific rea- soning in llms: Training from expert discussions via reinforcement learning.arXiv preprint arXiv:2505.19501, 2025

work page arXiv 2025
[73]

Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

Pengwei Sui, Michelle M Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

2026
[74]

Interpro in 2022.Nucleic acids research, 51(D1):D418–D427, 2023

Typhaine Paysan-Lafosse, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, Peer Bork, Alan Bridge, Lucy Colwell, et al. Interpro in 2022.Nucleic acids research, 51(D1):D418–D427, 2023

2022
[75]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Accessed 2026-05-04

2026
[77]

FineFineWeb: A comprehensive study on fine- grained domain web corpus

M-A-P, Ge Zhang, Xinrun Du, Zhimiao Yu, Zili Wang, Zekun Wang, Shuyue Guo, Tianyu Zheng, Kang Zhu, Jerry Liu, Shawn Yue, Binbin Liu, Zhongyuan Peng, Yifan Yao, Jack Yang, Ziming Li, Bingni Zhang, Minghao Liu, Tianyu Liu, Yang Gao, Wenhu Chen, Xiaohuan Zhou, Qian Liu, Taifeng Wang, and Wenhao Huang. FineFineWeb: A comprehensive study on fine- grained domai...
[78]

Version v0.1.0; Hugging Face dataset
[79]

KEGG: Kyoto encyclopedia of genes and genomes

Minoru Kanehisa and Susumu Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, 2000. doi: 10.1093/nar/28.1.27

work page doi:10.1093/nar/28.1.27 2000
[80]

New approach for understanding genome variations in KEGG.Nucleic Acids Research, 47(D1): D590–D595, 2019

Minoru Kanehisa, Yoko Sato, Miho Furumichi, Kanae Morishima, and Mao Tanabe. New approach for understanding genome variations in KEGG.Nucleic Acids Research, 47(D1): D590–D595, 2019. doi: 10.1093/nar/gky962. 15

work page doi:10.1093/nar/gky962 2019

Showing first 80 references.

[1] [1]

Maddison, and Bo Wang

Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, et al. Bioreason: Incentivizing 10 multimodal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579, 2025

work page arXiv 2025

[2] [2]

rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos. rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

2025

[3] [3]

Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

Adibvafa Fallahpour, Arman Seyed-Ahmadi, Parsa Idehpour, Omar Ibrahim, Purav Gupta, Jack Naimer, Kevin Zhu, Arnav Shah, Shihao Ma, Abhinav Adduri, et al. Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

2026

[4] [4]

Evolm: In search of lost language model training dynamics

Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric P Xing, Sham M Kakade, and Hanlin Zhang. Evolm: In search of lost language model training dynamics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[5] [5]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[6] [6]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv:2504.20571, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? NeurIPS, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? NeurIPS, 2025

2025

[8] [8]

arXiv preprint arXiv:2507.16812 , year=

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning.arXiv:2507.16812, 2025

work page arXiv 2025

[9] [9]

OpenThoughts: data recipes for reasoning models.ICLR, 2026

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. OpenThoughts: data recipes for reasoning models.ICLR, 2026

2026

[10] [10]

Scaling large language models for next-generation single-cell analysis.BioRxiv, pages 2025–04, 2026

Syed Asad Rizvi, Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Curtis Jamison Perry, Ivan Vrkic, Nicole Mayerli Constante, Zirui Fu, Sizhuang He, et al. Scaling large language models for next-generation single-cell analysis.BioRxiv, pages 2025–04, 2026

2025

[11] [11]

Chang Yu, Siyuan Li, Zicheng Liu, Jingbo Zhou, Xianglong Guo, Kai Yu, Yuqing Zhou, Ken Li, Zelin Zang, Zhen Lei, and Stan Z. Li. CDBridge: A cross-omics post-training bridge strategy for context-aware biological modeling. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Hk4Fb6kaYF

2026

[12] [12]

Unleashing scientific reasoning for bio-experimental protocol generation via structured component-based reward mechanism.ICLR, 2026

Haoran Sun, Yankai Jiang, Zhenyu Tang, Yaning Pan, Shuang Gu, Zekai Lin, Lilong Wang, Wenjie Lou, Lei Liu, Lei Bai, et al. Unleashing scientific reasoning for bio-experimental protocol generation via structured component-based reward mechanism.ICLR, 2026

2026

[13] [13]

Sci-verifier: Scientific verifier with thinking.ICLR, 2026

Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, Lei Bai, Ganqu Cui, et al. Sci-verifier: Scientific verifier with thinking.ICLR, 2026

2026

[14] [14]

Cellduality: Un- locking biological reasoning in LLMs with self-supervised RLVR

Yuhang Chen, Zhen Tan, Ruichen Zhang, Mufan Qiu, and Tianlong Chen. Cellduality: Un- locking biological reasoning in LLMs with self-supervised RLVR. InThe Fourteenth Interna- tional Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= I4meJN28Ol

2026

[15] [15]

VCWorld: a biological world model for virtual cell simulation.ICLR, 2026

Zhijian Wei, Runze Ma, Zichen Wang, Zhongmin Li, Shuotong Song, and Shuangjia Zheng. VCWorld: a biological world model for virtual cell simulation.ICLR, 2026

2026

[16] [16]

Helix: Evolutionary reinforcement learning for open-ended scientific problem solving.ICLR, 2026

Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, and Jun Zhu. Helix: Evolutionary reinforcement learning for open-ended scientific problem solving.ICLR, 2026. 11

2026

[17] [17]

Reshaping reasoning in llms: A theoretical analysis of rl training dynamics through pattern selection.ICLR, 2026

Xingwu Chen, Tianle Li, and Difan Zou. Reshaping reasoning in llms: A theoretical analysis of rl training dynamics through pattern selection.ICLR, 2026

2026

[18] [18]

Training dynamics impact post- training quantization robustness.ICLR, 2026

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness.ICLR, 2026

2026

[19] [19]

The coverage principle: How pre-training enables post-training.ICLR, 2026

Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T Ash, Akshay Krishnamurthy, and Dylan J Foster. The coverage principle: How pre-training enables post-training.ICLR, 2026

2026

[20] [20]

Benchmarking algorithms for generalizable single-cell perturbation response prediction.Nature Methods, 23(2):451–464, 2026

Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Ping Li, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction.Nature Methods, 23(2):451–464, 2026

2026

[21] [21]

A fully automated benchmarking suite to compare macromolecular complexes.Nature Methods, 23(2):387–394, 2026

Gabriel Studer, Xavier Robin, Stefan Bienert, Janani Durairaj, Peter Škrinjar, Gerardo Tauriello, Andrew Mark Waterhouse, and Torsten Schwede. A fully automated benchmarking suite to compare macromolecular complexes.Nature Methods, 23(2):387–394, 2026

2026

[22] [22]

PLINDER: the protein-ligand interactions dataset and evaluation resource.BioRxiv, pages 2024–07, 2024

Janani Durairaj, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, et al. PLINDER: the protein-ligand interactions dataset and evaluation resource.BioRxiv, pages 2024–07, 2024

2024

[23] [23]

ProCyon: a multimodal foundation model for protein phenotypes.BioRxiv, pages 2024–12, 2025

Owen Queen, Yepeng Huang, Robert Calef, Valentina Giunchiglia, Tianlong Chen, George Dasoulas, LeAnn Tai, Gianmarco Abbadessa, Owain Howell, Michelle M Li, et al. ProCyon: a multimodal foundation model for protein phenotypes.BioRxiv, pages 2024–12, 2025

2024

[24] [24]

Evaluating generalizability of artificial intelligence models for molecular datasets

Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian G Marin, Marinka Zitnik, and Maha Farhat. Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence, 6(12):1512–1524, 2024

2024

[25] [25]

Zero-shot evaluation reveals limitations of single-cell foundation models.Genome Biology, 26(1):101, 2025

Kasia Z Kedzierska, Lorin Crawford, Ava P Amini, and Alex X Lu. Zero-shot evaluation reveals limitations of single-cell foundation models.Genome Biology, 26(1):101, 2025

2025

[26] [26]

Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661, 2025

Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661, 2025

2025

[27] [27]

LoongRL: reinforcement learning for advanced reasoning over long contexts.ICLR, 2026

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, and Mao Yang. LoongRL: reinforcement learning for advanced reasoning over long contexts.ICLR, 2026

2026

[28] [28]

The art of scaling reinforcement learning compute for llms.ICLR, 2026

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.ICLR, 2026

2026

[29] [29]

Rethinking LLM reasoning: From explicit trajectories to latent representations

Cong Jiang, Xiaofeng Zhang, Fangzhi Zhu, XiaoWei Chen, Junxiong Zhu, and Zheng Zhang. Rethinking LLM reasoning: From explicit trajectories to latent representations. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=CbK7lYbmv8

2026

[30] [30]

CoT-Evo: evolutionary distillation of chain-of-thought for scientific reasoning.ICLR, 2026

Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, and Huajun Chen. CoT-Evo: evolutionary distillation of chain-of-thought for scientific reasoning.ICLR, 2026

2026

[31] [31]

scPilot: Large language model reasoning toward automated single-cell analysis and discovery.NeurIPS, 2025

Yiming Gao, Zhen Wang, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Jieyuan Liu, Enze Ma, Zhiting Hu, et al. scPilot: Large language model reasoning toward automated single-cell analysis and discovery.NeurIPS, 2025

2025

[32] [32]

AI-researcher: autonomous scientific innovation.NeurIPS, 2025

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: autonomous scientific innovation.NeurIPS, 2025

2025

[33] [33]

Training a scientific reasoning model for chemistry.NeurIPS, 2025

Siddharth M Narayanan, James D Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G Rodriques, and Andrew D White. Training a scientific reasoning model for chemistry.NeurIPS, 2025. 12

2025

[34] [34]

Language models for biological research: a primer.Nature Methods, 21(8):1422–1429, 2024

Elana Simon, Kyle Swanson, and James Zou. Language models for biological research: a primer.Nature Methods, 21(8):1422–1429, 2024

2024

[35] [35]

Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

2021

[36] [36]

Step 1–Step 7

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023

work page arXiv 2023

[37] [37]

Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

2024

[38] [38]

Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

2026

[39] [39]

Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P De Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

2025

[40] [40]

Alphagenome: advancing regulatory variant effect prediction with a unified dna sequence model.BioRxiv, pages 2025–06, 2025

Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, et al. Alphagenome: advancing regulatory variant effect prediction with a unified dna sequence model.BioRxiv, pages 2025–06, 2025

2025

[41] [41]

The omg dataset: An open metagenomic corpus for mixed-modality genomic language modeling.bioRxiv, pages 2024–08, 2024

Andre Cornman, Jacob West-Roberts, Antonio Pedro Camargo, Simon Roux, Martin Bera- cochea, Milot Mirdita, Sergey Ovchinnikov, and Yunha Hwang. The omg dataset: An open metagenomic corpus for mixed-modality genomic language modeling.bioRxiv, pages 2024–08, 2024

2024

[42] [42]

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

Yusen Hou, Weicai Long, Haitao Hu, Houcheng Su, Junning Feng, and Yanlin Zhang. Phagebench: Can llms understand raw bacteriophage genomes?arXiv preprint arXiv:2604.05775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Orthrus: toward evolutionary and functional rna foundation models.Nature Methods, pages 1–11, 2026

Philip Fradkin, Ruian “Ian” Shi, Taykhoom Dalal, Keren Isaev, Brendan J Frey, Leo J Lee, Quaid Morris, and Bo Wang. Orthrus: toward evolutionary and functional rna foundation models.Nature Methods, pages 1–11, 2026

2026

[44] [44]

Interpretable rna foundation model from unannotated data for highly accu- rate rna structure and function predictions.arXiv preprint arXiv:2204.00300,

Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, et al. Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions.arXiv preprint arXiv:2204.00300, 2022

work page arXiv 2022

[45] [45]

A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pages 2025–04, 2025

James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, et al. A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pages 2025–04, 2025

2025

[46] [46]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21(8):1470–1480, 2024

2024

[47] [47]

Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

Christina V Theodoris, Ling Xiao, Anant Chopra, Mark D Chaffin, Zeina R Al Sayed, Matthew C Hill, Helene Mantineo, Elizabeth M Brydon, Zexian Zeng, X Shirley Liu, et al. Transfer learning enables predictions in network biology.Nature, 618(7965):616–624, 2023

2023

[48] [48]

Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025

Abhinav K Adduri, Dhruv Gautam, Beatrice Bevilacqua, Alishba Imran, Rohan Shah, Mohsen Naghipourfar, Noam Teyssier, Rajesh Ilango, Sanjay Nagaraj, Mingze Dong, et al. Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025. 13

2025

[49] [49]

Large-scale foundation model on single-cell transcriptomics.Nature methods, 21(8):1481–1491, 2024

Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Yucheng Guo, Xingyi Cheng, Taifeng Wang, Jianzhu Ma, Xuegong Zhang, and Le Song. Large-scale foundation model on single-cell transcriptomics.Nature methods, 21(8):1481–1491, 2024

2024

[50] [50]

scgenept: Is language all you need for modeling single-cell perturbations?bioRxiv, pages 2024–10, 2024

Ana-Maria Istrate, Donghui Li, and Theofanis Karaletsos. scgenept: Is language all you need for modeling single-cell perturbations?bioRxiv, pages 2024–10, 2024

2024

[51] [51]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

2023

[52] [52]

Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

2025

[53] [53]

Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

2023

[54] [54]

Unified rational protein engineering with sequence-based deep representation learning

Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019

2019

[55] [55]

Multimodal learning enables chat-based exploration of single-cell data.Nature Biotechnology, pages 1–11, 2025

Moritz Schaefer, Peter Peneder, Daniel Malzl, Salvo Danilo Lombardo, Mihaela Peycheva, Jake Burton, Anna Hakobyan, Varun Sharma, Thomas Krausgruber, Celine Sin, et al. Multimodal learning enables chat-based exploration of single-cell data.Nature Biotechnology, pages 1–11, 2025

2025

[56] [56]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[57] [57]

D-cpt law: Domain-specific continual pre-training scaling law for large language models.Advances in Neural Information Processing Systems, 37:90318–90354, 2024

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, et al. D-cpt law: Domain-specific continual pre-training scaling law for large language models.Advances in Neural Information Processing Systems, 37:90318–90354, 2024

2024

[58] [58]

Understanding the effects of RLHF on LLM generalisation and diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. InInternational Conference on Learning Representations, 2024

2024

[59] [59]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360, 2020

2020

[60] [60]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[61] [61]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems, 2022

2022

[62] [62]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2023

2023

[63] [63]

When scaling meets LLM finetuning: The effect of data, model and finetuning method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets LLM finetuning: The effect of data, model and finetuning method. InInternational Conference on Learning Representations, 2024. 14

2024

[64] [64]

arXiv preprint arXiv:2308.04014 , year=

Kshitij Gupta, Dan Iter, and Daniel Hershcovich. Continual pre-training of large language models: How to (re)warm your model?arXiv preprint arXiv:2308.04014, 2023

work page arXiv 2023

[65] [65]

arXiv preprint arXiv:2403.08763 , year=

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, and Irina Rish. Simple and scalable strategies to continually pre-train large language models.arXiv preprint arXiv:2403.08763, 2024

work page arXiv 2024

[66] [66]

arXiv preprint arXiv:2407.07263 , year=

Jupinder Parmar, Sanjev Prabhu, Suchin Gururangan, Hailey Awadalla, Shaden Smith, and Niklas Muennighoff. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263, 2024

work page arXiv 2024

[67] [67]

Continual pre-training of language models

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models. InInternational Conference on Learning Representations, 2023

2023

[68] [68]

Adapting large language models via reading comprehension.arXiv preprint arXiv:2309.09530, 2024

Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models via reading comprehension.arXiv preprint arXiv:2309.09530, 2024

work page arXiv 2024

[69] [69]

Composer 2 technical report, 2026

Cursor Research et al. Composer 2 technical report, 2026. URL https://arxiv.org/abs/2603. 24477

2026

[70] [70]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. InInternational Conference on Machine Learning, pages 10818–10838. PMLR, 2025

2025

[71] [71]

Gene-r1: Reasoning with data- augmented lightweight llms for gene set analysis

Zhizheng Wang, Yifan Yang, Qiao Jin, and Zhiyong Lu. Gene-r1: Reasoning with data- augmented lightweight llms for gene set analysis. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 494–507. World Scientific, 2025

2026

[72] [72]

Toward scientific rea- soning in llms: Training from expert discussions via reinforcement learning.arXiv preprint arXiv:2505.19501, 2025

Ming Yin, Yuanhao Qu, Ling Yang, Le Cong, and Mengdi Wang. Toward scientific rea- soning in llms: Training from expert discussions via reinforcement learning.arXiv preprint arXiv:2505.19501, 2025

work page arXiv 2025

[73] [73]

Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

Pengwei Sui, Michelle M Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, pages 2026–01, 2026

2026

[74] [74]

Interpro in 2022.Nucleic acids research, 51(D1):D418–D427, 2023

Typhaine Paysan-Lafosse, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, Peer Bork, Alan Bridge, Lucy Colwell, et al. Interpro in 2022.Nucleic acids research, 51(D1):D418–D427, 2023

2022

[75] [75]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Accessed 2026-05-04

2026

[77] [77]

FineFineWeb: A comprehensive study on fine- grained domain web corpus

M-A-P, Ge Zhang, Xinrun Du, Zhimiao Yu, Zili Wang, Zekun Wang, Shuyue Guo, Tianyu Zheng, Kang Zhu, Jerry Liu, Shawn Yue, Binbin Liu, Zhongyuan Peng, Yifan Yao, Jack Yang, Ziming Li, Bingni Zhang, Minghao Liu, Tianyu Liu, Yang Gao, Wenhu Chen, Xiaohuan Zhou, Qian Liu, Taifeng Wang, and Wenhao Huang. FineFineWeb: A comprehensive study on fine- grained domai...

[78] [78]

Version v0.1.0; Hugging Face dataset

[79] [79]

KEGG: Kyoto encyclopedia of genes and genomes

Minoru Kanehisa and Susumu Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, 2000. doi: 10.1093/nar/28.1.27

work page doi:10.1093/nar/28.1.27 2000

[80] [80]

New approach for understanding genome variations in KEGG.Nucleic Acids Research, 47(D1): D590–D595, 2019

Minoru Kanehisa, Yoko Sato, Miho Furumichi, Kanae Morishima, and Mao Tanabe. New approach for understanding genome variations in KEGG.Nucleic Acids Research, 47(D1): D590–D595, 2019. doi: 10.1093/nar/gky962. 15

work page doi:10.1093/nar/gky962 2019