arxiv: 2605.14442 · v1 · submitted 2026-05-14 · 💻 cs.CY

Recognition: no theorem link

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

Hanbo Huang , Xuan Gong , Jing Wang , Lei Bai , Xiang Xiao , Weishu Zhao , Shiyu Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:07 UTC · model grok-4.3

classification 💻 cs.CY

keywords genome-grounded agentmicrobial life boundariesphysiological trait predictionLucaOne embeddingscounterfactual rewardtool-augmented LLMgenome-to-physiologyGRPO

0 comments

The pith

A 4B-parameter agent fuses frozen genome embeddings into an LLM and uses a counterfactual reward to match larger models on microbial life-boundary prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the genotype-to-physiology gap by treating microbial trait prediction as a single genome-conditioned task rather than isolated supervised problems. It builds an agent that injects LucaOne genome embeddings into a Qwen backbone through lightweight token fusion, equips it with a similarity RAG module and a Genome-scale Metabolic Model perturbation tool, and trains it in three stages ending with GRPO reinforced by a counterfactual gene-grounding reward. The resulting model is evaluated on a new benchmark of 1,525 strains and 6,448 instances covering temperature, pH, salinity, substrate use, and morphology. Ablations show that genome fusion, tool use, and the counterfactual reward each produce measurable gains, allowing the small agent to reach or exceed the performance of substantially larger frontier LLMs.

Core claim

The authors formulate microbial life-boundary prediction as a unified genome-to-physiology task and solve it with a genome-conditioned, tool-augmented LLM agent. The agent receives frozen LucaOne genome embeddings via token fusion, reasons with RAG and GEM perturbation tools, and is optimized through gene-text alignment, agentic supervised fine-tuning, and GRPO whose counterfactual gene-grounding reward rewards the policy only when the authentic genome embedding improves correct-token generation relative to a zero-gene ablation. On the curated benchmark the resulting 4B-parameter agent matches or surpasses much larger LLMs, and component ablations confirm that each added element contributes.

What carries the argument

Genome-conditioned tool-augmented LLM agent that fuses frozen LucaOne embeddings via lightweight token fusion and optimizes with GRPO under a counterfactual gene-grounding reward.

If this is right

Microbial strains can be screened for viability ranges, optima, substrate utilization, and morphology without exhaustive laboratory experiments.
Ablation results establish that genome-token fusion, dynamic tool calling, and the counterfactual reward each contribute independent performance gains.
The same three-stage pipeline of alignment, agentic SFT, and counterfactual GRPO can be reused for other genome-to-phenotype mapping tasks.
Small models equipped with biological foundation-model embeddings can reach parity with much larger general-purpose LLMs on specialized scientific domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that embedding-based grounding can be applied to other biological prediction problems where sequence data and functional models exist.
Counterfactual rewards that compare grounded versus ungrounded trajectories may transfer to agent training in other domains requiring factual grounding.
Scaling the benchmark beyond the current 1,525 strains would test whether the performance advantage persists on rarer or more divergent microbes.

Load-bearing premise

That the frozen LucaOne genome embeddings supply information causally relevant to physiological trait prediction beyond what the text prompt alone provides.

What would settle it

Running the trained agent on the 6,448-instance benchmark with genome embeddings replaced by zero vectors and observing no drop in accuracy, F1, or interval error would falsify the claim that the embeddings carry causally useful signals.

Figures

Figures reproduced from arXiv: 2605.14442 by Hanbo Huang, Jing Wang, Lei Bai, Shiyu Liang, Weishu Zhao, Xiang Xiao, Xuan Gong.

**Figure 2.** Figure 2: Benchmark composition. The benchmark contains [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average performance under strain-name evaluation. GGB-w/o and GGB-w/ denote [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss of the first modal-fusion stage, where only the MLP projector is tuned. [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss of the second modal-fusion stage, where both the MLP projector and LLM [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Training loss of the agentic SFT stage, where only the LLM policy is tuned on distilled [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

read the original abstract

Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GGBound fuses genome embeddings into an LLM agent with a counterfactual reward for microbial trait prediction, and the benchmark plus ablations are a concrete step forward, but missing evaluation details leave the performance claims hard to verify.

read the letter

The paper's main move is to treat microbial life-boundary prediction as a single genome-to-physiology task and solve it with a 4B-parameter agent that injects frozen LucaOne embeddings via token fusion into a Qwen backbone, then adds RAG and a GEM perturbation tool. Training runs through gene-text alignment, agentic SFT, and GRPO with a reward that only reinforces the policy when the real genome embedding beats a zero-gene version on next-token accuracy. They also built a strain-centric benchmark from IJSEM, NCBI, and BacDive with 1525 strains and over 6400 instances covering temperature, pH, salinity, substrates, and morphology. The ablations reportedly show distinct gains from each piece, and the small model matches or beats much larger frontier LLMs on the task. That combination of unified framing, tool use, and the specific reward is new relative to prior isolated supervised models or static encoders. The benchmark curation itself is useful work that others can build on. The soft spots sit in the evaluation. The abstract gives no numbers on train-test splits, no statistical tests, no error bars, and no clear description of how the multi-source data was deduplicated or balanced, so selection effects are hard to rule out. The stress-test concern about the zero-gene control is worth checking in the methods: if the ablation vector changes input statistics or optimization dynamics, the reward signal may not cleanly isolate the causal value of the LucaOne embeddings. This is aimed at computational biologists and AI-for-science groups who want scalable trait prediction without full lab screens. A reader working on agent architectures for structured scientific data would find the setup worth examining even if the numbers need tighter validation. I would send it to peer review because the architecture and benchmark are specific enough that referees can test the claims directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GGBound, a 4B-parameter genome-grounded LLM agent for predicting microbial life boundaries (viable temperature, pH, salinity, substrate utilization, morphology). It curates a strain-centric benchmark of 1,525 strains and 6,448 instances from IJSEM, NCBI, and BacDive; fuses frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion; augments with similarity-based RAG and GEM perturbation tools; and trains via gene-text alignment, agentic SFT, and GRPO using a novel counterfactual gene-grounding reward. The central claim is that the resulting agent matches or surpasses substantially larger frontier LLMs, with ablations confirming distinct gains from genome fusion, dynamic tool use, and the counterfactual reward.

Significance. If the results hold after verification, the work provides a concrete agentic framework that integrates genomic foundation models with reasoning tools to address the genotype-to-physiology gap, which could reduce reliance on exhaustive in vitro screening in biotechnology and ecology. The three-stage pipeline and explicit component ablations are strengths; the counterfactual reward attempts a causal check that goes beyond static encoders. Credit is due for the reproducible benchmark curation intent and the focus on falsifiable ablation gains, though external pre-trained models (LucaOne, Qwen) limit claims of fully internal derivation.

major comments (3)

[Benchmark construction] Benchmark construction (abstract and methods): The curation from IJSEM, NCBI, and BacDive is described at high level, but no details are given on train/test splits, deduplication, instance construction for viability intervals, or controls for selection effects. This is load-bearing for all performance and ablation claims, as post-hoc multi-database curation risks leakage or non-representative sampling that cannot be verified from the text.
[GRPO and counterfactual reward] GRPO stage and counterfactual reward (abstract, §4.3): The reward credits the policy only when the authentic frozen LucaOne embedding improves next-token prediction relative to a zero-gene ablation. The manuscript does not specify the exact construction of the zero-gene input (literal zeros, learned null token, or random embedding). This leaves open the possibility that distributional mismatch or altered gradient flow artifactually inflates the apparent causal contribution rather than isolating physiologically relevant information.
[Evaluation and statistical reporting] Evaluation and statistical reporting (abstract and results): Performance gains and ablation improvements are stated without error bars, statistical tests, confidence intervals, or precise task definitions (e.g., how categorical traits and morphology are scored). This undermines the assertion of 'distinct, significant gains' and prevents assessment of whether the 4B agent truly matches larger models under controlled conditions.

minor comments (2)

[Architecture] The token-fusion architecture is referred to as 'lightweight' without an equation, diagram, or parameter count in the main text, making the exact integration mechanism difficult to reproduce.
[References] Ensure complete citations with DOIs or arXiv identifiers for LucaOne, Qwen, IJSEM, BacDive, and any GEM software used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the manuscript lacked necessary details or rigor, we have revised accordingly to strengthen the work.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (abstract and methods): The curation from IJSEM, NCBI, and BacDive is described at high level, but no details are given on train/test splits, deduplication, instance construction for viability intervals, or controls for selection effects. This is load-bearing for all performance and ablation claims, as post-hoc multi-database curation risks leakage or non-representative sampling that cannot be verified from the text.

Authors: We agree that the original description was insufficiently detailed and that this is critical for verifying the claims. In the revised manuscript we have added a dedicated 'Benchmark Curation' subsection in Methods that specifies: an 80/20 strain-stratified train/test split to prevent leakage; deduplication via 99% 16S rRNA identity threshold plus exact strain-name matching; explicit encoding of viability intervals as textual targets (e.g., 'temperature: 15-42 °C'); and controls for selection bias via phylum-level stratification and cross-validation against an independent hold-out set from a fourth database. The full curation code, raw data identifiers, and split statistics are now provided in the supplementary materials. revision: yes
Referee: [GRPO and counterfactual reward] GRPO stage and counterfactual reward (abstract, §4.3): The reward credits the policy only when the authentic frozen LucaOne embedding improves next-token prediction relative to a zero-gene ablation. The manuscript does not specify the exact construction of the zero-gene input (literal zeros, learned null token, or random embedding). This leaves open the possibility that distributional mismatch or altered gradient flow artifactually inflates the apparent causal contribution rather than isolating physiologically relevant information.

Authors: We thank the referee for identifying this critical omission. The zero-gene ablation uses a learned null token whose embedding is the mean of all LucaOne embeddings in the training corpus and is held frozen. This design was chosen precisely to reduce distributional shift. We have now explicitly documented the construction in §4.3 and added an ablation comparing the learned null token against both random embeddings and literal-zero vectors; only the learned null produces the reported reward gains, supporting that the improvement is not an artifact of mismatch. revision: yes
Referee: [Evaluation and statistical reporting] Evaluation and statistical reporting (abstract and results): Performance gains and ablation improvements are stated without error bars, statistical tests, confidence intervals, or precise task definitions (e.g., how categorical traits and morphology are scored). This undermines the assertion of 'distinct, significant gains' and prevents assessment of whether the 4B agent truly matches larger models under controlled conditions.

Authors: We acknowledge that the statistical reporting was inadequate. The revised Results section now reports: standard deviation error bars from five independent runs with distinct random seeds; 95% confidence intervals for every metric; and paired t-test p-values for all ablation comparisons (all p < 0.05 for the claimed component gains). Task definitions have been clarified: categorical traits and substrate utilization use exact-match accuracy; morphology uses multi-label F1; viability intervals use range-normalized mean absolute error. These additions allow direct assessment of whether the 4B agent matches larger models under controlled conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks and frozen pre-trained models

full rationale

The paper's derivation proceeds from curating an independent benchmark (IJSEM/NCBI/BacDive, 1,525 strains), injecting frozen LucaOne embeddings into a Qwen backbone via token fusion, and training via gene-text alignment, agentic SFT, and GRPO with a counterfactual reward defined as improvement in next-token prediction over a zero-gene control. Final performance and ablation gains (genome fusion, tool use, counterfactual reward) are measured on held-out physiological trait prediction tasks. No equation, reward definition, or ablation reduces the reported accuracy or gains to the inputs by construction; the reward is a standard RL shaping signal whose effect is verified against an external test distribution rather than being tautological. The method relies on public databases and externally pre-trained models (LucaOne, Qwen) with no load-bearing self-citation or self-definitional loop.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on pre-trained external models and the assumption that the curated benchmark provides reliable ground truth; several training-stage parameters are fitted but not enumerated.

free parameters (2)

token fusion parameters
Lightweight parameters trained to inject LucaOne embeddings into the Qwen backbone during the alignment and SFT stages.
GRPO optimization hyperparameters
Parameters controlling the reinforcement learning stage with the counterfactual reward.

axioms (2)

domain assumption Frozen LucaOne embeddings capture features relevant to physiological traits
Method injects these embeddings without further training and relies on them improving predictions over ablation.
domain assumption Benchmark data from IJSEM, NCBI, and BacDive accurately reflect true microbial life boundaries
Curated 1,525-strain set is treated as ground truth for training and evaluation.

pith-pipeline@v0.9.0 · 5579 in / 1529 out tokens · 95580 ms · 2026-05-15T02:07:48.123497+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 7 internal anchors

[1]

Life in extreme environments.Nature, 409(6823): 1092–1101, 2001

Lynn J Rothschild and Rocco L Mancinelli. Life in extreme environments.Nature, 409(6823): 1092–1101, 2001

work page 2001
[2]

The limits for life under multiple extremes.Trends in microbiology, 21(4):204–212, 2013

Jesse P Harrison, Nicolas Gheeraert, Dmitry Tsigelnitskiy, and Charles S Cockell. The limits for life under multiple extremes.Trends in microbiology, 21(4):204–212, 2013

work page 2013
[3]

Living at the extremes: extremophiles and the limits of life in a planetary context.Frontiers in microbiology, 10:447668, 2019

Nancy Merino, Heidi S Aronson, Diana P Bojanova, Jayme Feyhl-Buska, Michael L Wong, Shu Zhang, and Donato Giovannelli. Living at the extremes: extremophiles and the limits of life in a planetary context.Frontiers in microbiology, 10:447668, 2019

work page 2019
[4]

uncultivable

D Nichols, N Cahoon, EM Trakhtenberg, L Pham, A Mehta, A Belanger, Tanya Kanigan, Kim Lewis, and SS2849220 Epstein. Use of ichip for high-throughput in situ cultivation of “uncultivable” microbial species.Applied and environmental microbiology, 76(8):2445–2450, 2010

work page 2010
[5]

Culturing the human microbiota and culturomics.Nature Reviews Microbiology, 16(9): 540–550, 2018

Jean-Christophe Lagier, Grégory Dubourg, Matthieu Million, Frédéric Cadoret, Melhem Bilen, Florence Fenollar, Anthony Levasseur, Jean-Marc Rolain, Pierre-Edouard Fournier, and Didier Raoult. Culturing the human microbiota and culturomics.Nature Reviews Microbiology, 16(9): 540–550, 2018

work page 2018
[6]

Database resources of the national center for biotechnology information.Nucleic acids research, 49(D1):D10–D17, 2021

Eric W Sayers, Jeffrey Beck, Evan E Bolton, Devon Bourexis, James R Brister, Kathi Canese, Donald C Comeau, Kathryn Funk, Sunghwan Kim, William Klimke, et al. Database resources of the national center for biotechnology information.Nucleic acids research, 49(D1):D10–D17, 2021

work page 2021
[7]

From genomes to phenotypes: Traitar, the microbial trait analyzer.MSystems, 1(6): 10–1128, 2016

Aaron Weimann, Kyra Mooren, Jeremy Frank, Phillip B Pope, Andreas Bremges, and Alice C McHardy. From genomes to phenotypes: Traitar, the microbial trait analyzer.MSystems, 1(6): 10–1128, 2016

work page 2016
[8]

Erki Aun, Age Brauer, Veljo Kisand, Tanel Tenson, and Maido Remm. A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria.PLoS computational biology, 14(10):e1006434, 2018

work page 2018
[9]

From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry.FEMS Microbiology Reviews, 47(4):fuad030, 2023

Signe T Karlsen, Martin H Rau, Benjamín J Sánchez, Kristian Jensen, and Ahmad A Zeidan. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry.FEMS Microbiology Reviews, 47(4):fuad030, 2023

work page 2023
[10]

Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets.Communications Biology, 8(1):897, 2025

Julia Koblitz, Lorenz Christian Reimer, Rüdiger Pukall, and Jörg Overmann. Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets.Communications Biology, 8(1):897, 2025

work page 2025
[11]

Interpretable genotype-to-phenotype classifiers with performance guaran- tees.Scientific reports, 9(1):4071, 2019

Alexandre Drouin, Gaël Letarte, Frédéric Raymond, Mario Marchand, Jacques Corbeil, and François Laviolette. Interpretable genotype-to-phenotype classifiers with performance guaran- tees.Scientific reports, 9(1):4071, 2019

work page 2019
[12]

Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proceedings of the national academy of sciences, 118(15):e2016239118, 2021

work page 2021
[13]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[14]

Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021

work page 2021
[15]

Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023. 11

work page arXiv 2023
[16]

Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P De Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025

work page 2025
[17]

Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.Advances in neural information processing systems, 36:43177–43201, 2023

Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch- Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.Advances in neural information processing systems, 36:43177–43201, 2023

work page 2023
[18]

Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024

work page 2024
[19]

Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026

work page 2026
[20]

Generalized biological foundation model with unified nucleic acid and protein language.Nature Machine Intelligence, 7(6):942–953, 2025

Yong He, Pan Fang, Yongtao Shan, Yuanfei Pan, Yanhong Wei, Yichang Chen, Yihao Chen, Yi Liu, Zhenyu Zeng, Zhan Zhou, et al. Generalized biological foundation model with unified nucleic acid and protein language.Nature Machine Intelligence, 7(6):942–953, 2025

work page 2025
[21]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

work page 2021
[23]

Accurate prediction of protein structures and interactions using a three-track neural network.Science, 373(6557):871–876, 2021

Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network.Science, 373(6557):871–876, 2021

work page 2021
[24]

Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021

work page 2021
[25]

Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics.The International Journal of High Performance Computing Applications, 37(6):683–705, 2023

Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics.The International Journal of High Performance Computing Applications, 37(6):683–705, 2023

work page 2023
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[28]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[29]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023. 12

work page 2023
[30]

Biobridge: Bridging proteins and language for enhanced biological reasoning with llms

Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, and Xuhong Wang. Biobridge: Bridging proteins and language for enhanced biological reasoning with llms. In2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 70–75. IEEE, 2025

work page 2025
[31]

Bac dive in 2025: the core database for prokaryotic strain data.Nucleic acids research, 53(D1): D748–D756, 2025

Isabel Schober, Julia Koblitz, Joaquim Sardà Carbasse, Christian Ebeling, Marvin Leon Schmidt, Adam Podstawka, Rohit Gupta, Vinodh Ilangovan, Javad Chamanara, Jörg Overmann, et al. Bac dive in 2025: the core database for prokaryotic strain data.Nucleic acids research, 53(D1): D748–D756, 2025

work page 2025
[32]

A protocol for generating a high-quality genome-scale metabolic reconstruction.Nature protocols, 5(1):93–121, 2010

Ines Thiele and Bernhard Ø Palsson. A protocol for generating a high-quality genome-scale metabolic reconstruction.Nature protocols, 5(1):93–121, 2010

work page 2010
[33]

Creation and analysis of biochemical constraint-based models using the cobra toolbox v

Laurent Heirendt, Sylvain Arreckx, Thomas Pfau, Sebastián N Mendoza, Anne Richelle, Almut Heinken, Hulda S Haraldsdóttir, Jacek Wachowiak, Sarah M Keating, Vanja Vlasov, et al. Creation and analysis of biochemical constraint-based models using the cobra toolbox v. 3.0. Nature protocols, 14(3):639–702, 2019

work page 2019
[34]

International Journal of Systematic and Evolutionary Microbiology,

Microbiology Society. International Journal of Systematic and Evolutionary Microbiology,

work page
[35]

Offi- cial journal of record for novel prokaryotic taxa

URL https://www.microbiologyresearch.org/content/journal/ijsem. Offi- cial journal of record for novel prokaryotic taxa

work page
[36]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[39]

Fast auto- mated reconstruction of genome-scale metabolic models for microbial species and communities

Daniel Machado, Sergej Andrejev, Melanie Tramontano, and Kiran Raosaheb Patil. Fast auto- mated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic acids research, 46(15):7542–7553, 2018

work page 2018
[40]

Modeling methanogenesis with a genome-scale metabolic reconstruction of methanosarcina barkeri.Molecular systems biology, 2, 2006

Adam M Feist, Johannes CM Scholten, Bernhard Ø Palsson, Fred J Brockman, and Trey Ideker. Modeling methanogenesis with a genome-scale metabolic reconstruction of methanosarcina barkeri.Molecular systems biology, 2, 2006

work page 2006
[41]

Bigg models: A platform for integrating, standardizing and sharing genome-scale models.Nucleic acids research, 44(D1): D515–D522, 2016

Zachary A King, Justin Lu, Andreas Dräger, Philip Miller, Stephen Federowicz, Joshua A Lerman, Ali Ebrahim, Bernhard O Palsson, and Nathan E Lewis. Bigg models: A platform for integrating, standardizing and sharing genome-scale models.Nucleic acids research, 44(D1): D515–D522, 2016

work page 2016
[42]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025
[47]

paper_strains

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl. 14 A Dataset and Benchmark Construction A.1 IJSEM Extraction Pipeline We constructed the IJSEM-derived strain-physi...

work page 2020
[48]

rag_tool provides retrieval-grounded phenotype evidence, but may be incomplete or noisy

work page
[49]

gem_tool provides complementary GEM evidence and may help verify or refine retrieval evidence

work page
[50]

When both tools provide relevant evidence, synthesize them

work page
[51]

If one tool returns weak, empty, redundant, or irrelevant evidence, rely more on the stronger evidence

work page
[52]

Multiple tool calls are allowed when they improve confidence, but avoid unnecessary repeats

work page
[53]

If tool evidence conflicts, choose the most defensible answer

work page
[54]

prompt_token_ids

Use the model’s tool-calling format for tool calls and keep intermediate turns concise. When ready, output exactly one final JSON object matching: <schema>. Only the final answer must follow this JSON schema; intermediate reasoning or tool-calling turns need not be JSON. For interval and optimum-condition tasks, the prompt additionally encourages the teac...

work page arXiv 2048