SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Binze Hu; Huayu Sha; Jiazheng Zhang; Jingqi Tong; Jixuan Huang; Junlin Shang; Lei Bai; Ming Zhang; Qiyuan Peng; Qi Zhang

arxiv: 2602.12984 · v2 · pith:OQYRLUQUnew · submitted 2026-02-13 · 💻 cs.CL

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen , Yajie Yang , Zhiheng Xi , Binze Hu , Huayu Sha , Jiazheng Zhang , Qiyuan Peng , Junlin Shang

show 12 more authors

Jixuan Huang Yutao Fan Jingqi Tong Shihan Dou Ming Zhang Lei Bai Zhenfei Yin Tao Gui Xingjun Ma Qi Zhang Xuanjing Huang Yu-Gang Jiang

This is my paper

classification 💻 cs.CL

keywords scientificagentstool-usecapabilitiesdomain-specificevaluationmodelssciagentgym

0 comments

read the original abstract

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.