Recognition: no theorem link
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
Pith reviewed 2026-05-15 02:07 UTC · model grok-4.3
The pith
A 4B-parameter agent fuses frozen genome embeddings into an LLM and uses a counterfactual reward to match larger models on microbial life-boundary prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formulate microbial life-boundary prediction as a unified genome-to-physiology task and solve it with a genome-conditioned, tool-augmented LLM agent. The agent receives frozen LucaOne genome embeddings via token fusion, reasons with RAG and GEM perturbation tools, and is optimized through gene-text alignment, agentic supervised fine-tuning, and GRPO whose counterfactual gene-grounding reward rewards the policy only when the authentic genome embedding improves correct-token generation relative to a zero-gene ablation. On the curated benchmark the resulting 4B-parameter agent matches or surpasses much larger LLMs, and component ablations confirm that each added element contributes.
What carries the argument
Genome-conditioned tool-augmented LLM agent that fuses frozen LucaOne embeddings via lightweight token fusion and optimizes with GRPO under a counterfactual gene-grounding reward.
If this is right
- Microbial strains can be screened for viability ranges, optima, substrate utilization, and morphology without exhaustive laboratory experiments.
- Ablation results establish that genome-token fusion, dynamic tool calling, and the counterfactual reward each contribute independent performance gains.
- The same three-stage pipeline of alignment, agentic SFT, and counterfactual GRPO can be reused for other genome-to-phenotype mapping tasks.
- Small models equipped with biological foundation-model embeddings can reach parity with much larger general-purpose LLMs on specialized scientific domains.
Where Pith is reading between the lines
- The approach suggests that embedding-based grounding can be applied to other biological prediction problems where sequence data and functional models exist.
- Counterfactual rewards that compare grounded versus ungrounded trajectories may transfer to agent training in other domains requiring factual grounding.
- Scaling the benchmark beyond the current 1,525 strains would test whether the performance advantage persists on rarer or more divergent microbes.
Load-bearing premise
That the frozen LucaOne genome embeddings supply information causally relevant to physiological trait prediction beyond what the text prompt alone provides.
What would settle it
Running the trained agent on the 6,448-instance benchmark with genome embeddings replaced by zero vectors and observing no drop in accuracy, F1, or interval error would falsify the claim that the embeddings carry causally useful signals.
Figures
read the original abstract
Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GGBound, a 4B-parameter genome-grounded LLM agent for predicting microbial life boundaries (viable temperature, pH, salinity, substrate utilization, morphology). It curates a strain-centric benchmark of 1,525 strains and 6,448 instances from IJSEM, NCBI, and BacDive; fuses frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion; augments with similarity-based RAG and GEM perturbation tools; and trains via gene-text alignment, agentic SFT, and GRPO using a novel counterfactual gene-grounding reward. The central claim is that the resulting agent matches or surpasses substantially larger frontier LLMs, with ablations confirming distinct gains from genome fusion, dynamic tool use, and the counterfactual reward.
Significance. If the results hold after verification, the work provides a concrete agentic framework that integrates genomic foundation models with reasoning tools to address the genotype-to-physiology gap, which could reduce reliance on exhaustive in vitro screening in biotechnology and ecology. The three-stage pipeline and explicit component ablations are strengths; the counterfactual reward attempts a causal check that goes beyond static encoders. Credit is due for the reproducible benchmark curation intent and the focus on falsifiable ablation gains, though external pre-trained models (LucaOne, Qwen) limit claims of fully internal derivation.
major comments (3)
- [Benchmark construction] Benchmark construction (abstract and methods): The curation from IJSEM, NCBI, and BacDive is described at high level, but no details are given on train/test splits, deduplication, instance construction for viability intervals, or controls for selection effects. This is load-bearing for all performance and ablation claims, as post-hoc multi-database curation risks leakage or non-representative sampling that cannot be verified from the text.
- [GRPO and counterfactual reward] GRPO stage and counterfactual reward (abstract, §4.3): The reward credits the policy only when the authentic frozen LucaOne embedding improves next-token prediction relative to a zero-gene ablation. The manuscript does not specify the exact construction of the zero-gene input (literal zeros, learned null token, or random embedding). This leaves open the possibility that distributional mismatch or altered gradient flow artifactually inflates the apparent causal contribution rather than isolating physiologically relevant information.
- [Evaluation and statistical reporting] Evaluation and statistical reporting (abstract and results): Performance gains and ablation improvements are stated without error bars, statistical tests, confidence intervals, or precise task definitions (e.g., how categorical traits and morphology are scored). This undermines the assertion of 'distinct, significant gains' and prevents assessment of whether the 4B agent truly matches larger models under controlled conditions.
minor comments (2)
- [Architecture] The token-fusion architecture is referred to as 'lightweight' without an equation, diagram, or parameter count in the main text, making the exact integration mechanism difficult to reproduce.
- [References] Ensure complete citations with DOIs or arXiv identifiers for LucaOne, Qwen, IJSEM, BacDive, and any GEM software used.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the manuscript lacked necessary details or rigor, we have revised accordingly to strengthen the work.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and methods): The curation from IJSEM, NCBI, and BacDive is described at high level, but no details are given on train/test splits, deduplication, instance construction for viability intervals, or controls for selection effects. This is load-bearing for all performance and ablation claims, as post-hoc multi-database curation risks leakage or non-representative sampling that cannot be verified from the text.
Authors: We agree that the original description was insufficiently detailed and that this is critical for verifying the claims. In the revised manuscript we have added a dedicated 'Benchmark Curation' subsection in Methods that specifies: an 80/20 strain-stratified train/test split to prevent leakage; deduplication via 99% 16S rRNA identity threshold plus exact strain-name matching; explicit encoding of viability intervals as textual targets (e.g., 'temperature: 15-42 °C'); and controls for selection bias via phylum-level stratification and cross-validation against an independent hold-out set from a fourth database. The full curation code, raw data identifiers, and split statistics are now provided in the supplementary materials. revision: yes
-
Referee: [GRPO and counterfactual reward] GRPO stage and counterfactual reward (abstract, §4.3): The reward credits the policy only when the authentic frozen LucaOne embedding improves next-token prediction relative to a zero-gene ablation. The manuscript does not specify the exact construction of the zero-gene input (literal zeros, learned null token, or random embedding). This leaves open the possibility that distributional mismatch or altered gradient flow artifactually inflates the apparent causal contribution rather than isolating physiologically relevant information.
Authors: We thank the referee for identifying this critical omission. The zero-gene ablation uses a learned null token whose embedding is the mean of all LucaOne embeddings in the training corpus and is held frozen. This design was chosen precisely to reduce distributional shift. We have now explicitly documented the construction in §4.3 and added an ablation comparing the learned null token against both random embeddings and literal-zero vectors; only the learned null produces the reported reward gains, supporting that the improvement is not an artifact of mismatch. revision: yes
-
Referee: [Evaluation and statistical reporting] Evaluation and statistical reporting (abstract and results): Performance gains and ablation improvements are stated without error bars, statistical tests, confidence intervals, or precise task definitions (e.g., how categorical traits and morphology are scored). This undermines the assertion of 'distinct, significant gains' and prevents assessment of whether the 4B agent truly matches larger models under controlled conditions.
Authors: We acknowledge that the statistical reporting was inadequate. The revised Results section now reports: standard deviation error bars from five independent runs with distinct random seeds; 95% confidence intervals for every metric; and paired t-test p-values for all ablation comparisons (all p < 0.05 for the claimed component gains). Task definitions have been clarified: categorical traits and substrate utilization use exact-match accuracy; morphology uses multi-label F1; viability intervals use range-normalized mean absolute error. These additions allow direct assessment of whether the 4B agent matches larger models under controlled conditions. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmarks and frozen pre-trained models
full rationale
The paper's derivation proceeds from curating an independent benchmark (IJSEM/NCBI/BacDive, 1,525 strains), injecting frozen LucaOne embeddings into a Qwen backbone via token fusion, and training via gene-text alignment, agentic SFT, and GRPO with a counterfactual reward defined as improvement in next-token prediction over a zero-gene control. Final performance and ablation gains (genome fusion, tool use, counterfactual reward) are measured on held-out physiological trait prediction tasks. No equation, reward definition, or ablation reduces the reported accuracy or gains to the inputs by construction; the reward is a standard RL shaping signal whose effect is verified against an external test distribution rather than being tautological. The method relies on public databases and externally pre-trained models (LucaOne, Qwen) with no load-bearing self-citation or self-definitional loop.
Axiom & Free-Parameter Ledger
free parameters (2)
- token fusion parameters
- GRPO optimization hyperparameters
axioms (2)
- domain assumption Frozen LucaOne embeddings capture features relevant to physiological traits
- domain assumption Benchmark data from IJSEM, NCBI, and BacDive accurately reflect true microbial life boundaries
Reference graph
Works this paper leans on
-
[1]
Life in extreme environments.Nature, 409(6823): 1092–1101, 2001
Lynn J Rothschild and Rocco L Mancinelli. Life in extreme environments.Nature, 409(6823): 1092–1101, 2001
work page 2001
-
[2]
The limits for life under multiple extremes.Trends in microbiology, 21(4):204–212, 2013
Jesse P Harrison, Nicolas Gheeraert, Dmitry Tsigelnitskiy, and Charles S Cockell. The limits for life under multiple extremes.Trends in microbiology, 21(4):204–212, 2013
work page 2013
-
[3]
Nancy Merino, Heidi S Aronson, Diana P Bojanova, Jayme Feyhl-Buska, Michael L Wong, Shu Zhang, and Donato Giovannelli. Living at the extremes: extremophiles and the limits of life in a planetary context.Frontiers in microbiology, 10:447668, 2019
work page 2019
-
[4]
D Nichols, N Cahoon, EM Trakhtenberg, L Pham, A Mehta, A Belanger, Tanya Kanigan, Kim Lewis, and SS2849220 Epstein. Use of ichip for high-throughput in situ cultivation of “uncultivable” microbial species.Applied and environmental microbiology, 76(8):2445–2450, 2010
work page 2010
-
[5]
Culturing the human microbiota and culturomics.Nature Reviews Microbiology, 16(9): 540–550, 2018
Jean-Christophe Lagier, Grégory Dubourg, Matthieu Million, Frédéric Cadoret, Melhem Bilen, Florence Fenollar, Anthony Levasseur, Jean-Marc Rolain, Pierre-Edouard Fournier, and Didier Raoult. Culturing the human microbiota and culturomics.Nature Reviews Microbiology, 16(9): 540–550, 2018
work page 2018
-
[6]
Eric W Sayers, Jeffrey Beck, Evan E Bolton, Devon Bourexis, James R Brister, Kathi Canese, Donald C Comeau, Kathryn Funk, Sunghwan Kim, William Klimke, et al. Database resources of the national center for biotechnology information.Nucleic acids research, 49(D1):D10–D17, 2021
work page 2021
-
[7]
From genomes to phenotypes: Traitar, the microbial trait analyzer.MSystems, 1(6): 10–1128, 2016
Aaron Weimann, Kyra Mooren, Jeremy Frank, Phillip B Pope, Andreas Bremges, and Alice C McHardy. From genomes to phenotypes: Traitar, the microbial trait analyzer.MSystems, 1(6): 10–1128, 2016
work page 2016
-
[8]
Erki Aun, Age Brauer, Veljo Kisand, Tanel Tenson, and Maido Remm. A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria.PLoS computational biology, 14(10):e1006434, 2018
work page 2018
-
[9]
Signe T Karlsen, Martin H Rau, Benjamín J Sánchez, Kristian Jensen, and Ahmad A Zeidan. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry.FEMS Microbiology Reviews, 47(4):fuad030, 2023
work page 2023
-
[10]
Julia Koblitz, Lorenz Christian Reimer, Rüdiger Pukall, and Jörg Overmann. Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets.Communications Biology, 8(1):897, 2025
work page 2025
-
[11]
Alexandre Drouin, Gaël Letarte, Frédéric Raymond, Mario Marchand, Jacques Corbeil, and François Laviolette. Interpretable genotype-to-phenotype classifiers with performance guaran- tees.Scientific reports, 9(1):4071, 2019
work page 2019
-
[12]
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proceedings of the national academy of sciences, 118(15):e2016239118, 2021
work page 2021
-
[13]
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023
work page 2023
-
[14]
Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021
work page 2021
-
[15]
Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert- 2: Efficient foundation model and benchmark for multi-species genome.arXiv preprint arXiv:2306.15006, 2023. 11
-
[16]
Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P De Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025
work page 2025
-
[17]
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch- Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.Advances in neural information processing systems, 36:43177–43201, 2023
work page 2023
-
[18]
Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024
work page 2024
-
[19]
Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026
Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2.Nature, pages 1–13, 2026
work page 2026
-
[20]
Yong He, Pan Fang, Yongtao Shan, Yuanfei Pan, Yanhong Wei, Yichang Chen, Yihao Chen, Yi Liu, Zhenyu Zeng, Zhan Zhou, et al. Generalized biological foundation model with unified nucleic acid and protein language.Nature Machine Intelligence, 7(6):942–953, 2025
work page 2025
-
[21]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021
work page 2021
-
[23]
Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network.Science, 373(6557):871–876, 2021
work page 2021
-
[24]
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021
work page 2021
-
[25]
Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics.The International Journal of High Performance Computing Applications, 37(6):683–705, 2023
work page 2023
-
[26]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[28]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[29]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023. 12
work page 2023
-
[30]
Biobridge: Bridging proteins and language for enhanced biological reasoning with llms
Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, and Xuhong Wang. Biobridge: Bridging proteins and language for enhanced biological reasoning with llms. In2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 70–75. IEEE, 2025
work page 2025
-
[31]
Isabel Schober, Julia Koblitz, Joaquim Sardà Carbasse, Christian Ebeling, Marvin Leon Schmidt, Adam Podstawka, Rohit Gupta, Vinodh Ilangovan, Javad Chamanara, Jörg Overmann, et al. Bac dive in 2025: the core database for prokaryotic strain data.Nucleic acids research, 53(D1): D748–D756, 2025
work page 2025
-
[32]
Ines Thiele and Bernhard Ø Palsson. A protocol for generating a high-quality genome-scale metabolic reconstruction.Nature protocols, 5(1):93–121, 2010
work page 2010
-
[33]
Creation and analysis of biochemical constraint-based models using the cobra toolbox v
Laurent Heirendt, Sylvain Arreckx, Thomas Pfau, Sebastián N Mendoza, Anne Richelle, Almut Heinken, Hulda S Haraldsdóttir, Jacek Wachowiak, Sarah M Keating, Vanja Vlasov, et al. Creation and analysis of biochemical constraint-based models using the cobra toolbox v. 3.0. Nature protocols, 14(3):639–702, 2019
work page 2019
-
[34]
International Journal of Systematic and Evolutionary Microbiology,
Microbiology Society. International Journal of Systematic and Evolutionary Microbiology,
-
[35]
Offi- cial journal of record for novel prokaryotic taxa
URL https://www.microbiologyresearch.org/content/journal/ijsem. Offi- cial journal of record for novel prokaryotic taxa
-
[36]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[39]
Daniel Machado, Sergej Andrejev, Melanie Tramontano, and Kiran Raosaheb Patil. Fast auto- mated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic acids research, 46(15):7542–7553, 2018
work page 2018
-
[40]
Adam M Feist, Johannes CM Scholten, Bernhard Ø Palsson, Fred J Brockman, and Trey Ideker. Modeling methanogenesis with a genome-scale metabolic reconstruction of methanosarcina barkeri.Molecular systems biology, 2, 2006
work page 2006
-
[41]
Zachary A King, Justin Lu, Andreas Dräger, Philip Miller, Stephen Federowicz, Joshua A Lerman, Ali Ebrahim, Bernhard O Palsson, and Nathan E Lewis. Bigg models: A platform for integrating, standardizing and sharing genome-scale models.Nucleic acids research, 44(D1): D515–D522, 2016
work page 2016
-
[42]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Open r1: A fully open reproduction of deepseek-r1, January 2025
Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1
work page 2025
-
[47]
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl. 14 A Dataset and Benchmark Construction A.1 IJSEM Extraction Pipeline We constructed the IJSEM-derived strain-physi...
work page 2020
-
[48]
rag_tool provides retrieval-grounded phenotype evidence, but may be incomplete or noisy
-
[49]
gem_tool provides complementary GEM evidence and may help verify or refine retrieval evidence
-
[50]
When both tools provide relevant evidence, synthesize them
-
[51]
If one tool returns weak, empty, redundant, or irrelevant evidence, rely more on the stronger evidence
-
[52]
Multiple tool calls are allowed when they improve confidence, but avoid unnecessary repeats
-
[53]
If tool evidence conflicts, choose the most defensible answer
-
[54]
Use the model’s tool-calling format for tool calls and keep intermediate turns concise. When ready, output exactly one final JSON object matching: <schema>. Only the final answer must follow this JSON schema; intermediate reasoning or tool-calling turns need not be JSON. For interval and optimum-condition tasks, the prompt additionally encourages the teac...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.