arxiv: 2604.27977 · v2 · submitted 2026-04-30 · 💻 cs.AI · cs.LG

Recognition: unknown

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Hanane Nour Moussa , Yifei Li , Zhuoyang Li , Yankai Yang , Cheng Tang , Tianshu Zhang , Nesreen K. Ahmed , Ali Payani

show 2 more authors

Ziru Chen Huan Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords data-driven discoveryverifiable environmentsscientific agentslanguage modelsevaluation scriptsexecutable environmentsScienceAgentBench

0 comments

The pith

D3-Gym supplies 565 verifiable environments from real scientific repositories to train language-model agents for data-driven discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs D3-Gym as the first automatically generated collection of executable environments drawn directly from 239 published scientific code repositories across four disciplines. Each of the 565 tasks supplies a natural-language problem statement, a ready-to-run environment with dependencies and data, a reference solution, and an evaluation script created without human intervention. These scripts match human gold-standard judgments 87.5 percent of the time and preserve domain-specific correctness criteria. When language models train on solution trajectories sampled from D3-Gym, their scores rise on the ScienceAgentBench benchmark, with the 32-billion-parameter Qwen3 model gaining 7.8 absolute points and closing much of the distance to proprietary systems.

Core claim

D3-Gym is built by sourcing tasks from real scientific repositories and automatically equipping each one with an executable environment, input data, reference code, and a synthesized evaluation script. Rigorous checks show the scripts agree with human annotations at 87.5 percent while preserving scientific evaluation logic. Training on trajectories generated inside these environments produces consistent gains across Qwen3 model sizes on ScienceAgentBench, raising the 32B model by 7.8 points and shrinking the gap to strong closed-source models.

What carries the argument

D3-Gym, a dataset of 565 tasks each pairing a real scientific problem with an executable environment and an automatically synthesized evaluation script that checks solution correctness against domain-specific scientific criteria.

Load-bearing premise

The automatically synthesized evaluation scripts correctly capture scientific soundness for the full set of tasks and that success on these tasks transfers to genuine open-ended discovery rather than mere code completion.

What would settle it

Human scientists reviewing a large random sample of model solutions on D3-Gym tasks find agreement with the auto-generated scripts well below 87.5 percent, or models trained on D3-Gym trajectories show no improvement on a fresh collection of real laboratory or field discovery problems outside ScienceAgentBench.

Figures

Figures reproduced from arXiv: 2604.27977 by Ali Payani, Cheng Tang, Hanane Nour Moussa, Huan Sun, Nesreen K. Ahmed, Tianshu Zhang, Yankai Yang, Yifei Li, Zhuoyang Li, Ziru Chen.

**Figure 1.** Figure 1: (a) Overview of D3-Gym. (b) Success Rate (SR@3) on ScienceAgentBench (SAB) view at source ↗

**Figure 2.** Figure 2: Overview of the D3-Gym construction workflow. Candidate tasks from scientific view at source ↗

**Figure 3.** Figure 3: D3-Gym statistics. (a) Distribution of tasks by scientific discipline. (b) Distribution view at source ↗

**Figure 4.** Figure 4: (a) Comparison between training on AutoSDT-5K ( view at source ↗

**Figure 5.** Figure 5: Detailed breakdown of performance and error types on ScienceAgentBench for view at source ↗

read the original abstract

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D3-Gym scales verifiable scientific tasks to 565 examples via auto-synthesized scripts and shows training gains on ScienceAgentBench, but the script quality check is too thin to fully support the attribution.

read the letter

D3-Gym builds 565 tasks drawn from 239 real scientific repositories across four disciplines. Each task ships with an instruction, a pre-configured executable environment, data previews, a reference solution, and an automatically generated evaluation script. The authors report 87.5% agreement between those scripts and human gold standards on a sample, then show that training on trajectories sampled from the environments lifts Qwen3 models on ScienceAgentBench, including a 7.8-point gain for the 32B variant and a narrower gap to proprietary systems. The full set of environments, trajectories, and models is released on GitHub.

Referee Report

3 major / 3 minor

Summary. The paper introduces D3-Gym, the first automatically constructed dataset of 565 verifiable environments for scientific data-driven discovery tasks, sourced from 239 real repositories across four disciplines. Each task provides a natural language instruction, executable environment with dependencies, input data and previews, reference code solution, and an automatically synthesized evaluation script. The authors report that these scripts achieve 87.5% agreement with human gold standards and strong domain-specific alignment. They further show that training on trajectories sampled from D3-Gym produces consistent gains on the external ScienceAgentBench benchmark, including a 7.8 absolute point improvement for Qwen3-32B and a reduced gap to proprietary models. All artifacts are released.

Significance. If the verification signal proves reliable across the full task set, D3-Gym would be a valuable resource for scaling training of agents on real scientific workflows rather than synthetic or non-verifiable code tasks. The empirical transfer result on ScienceAgentBench is concrete and the public release of environments, trajectories, and models supports reproducibility. The work addresses a genuine bottleneck in data-driven discovery agents, though the extent of transfer beyond the benchmark to open-ended discovery requires further evidence.

major comments (3)

[§4] §4 (Evaluation of Verification Signal): The 87.5% agreement with human-annotated gold standards is presented as evidence of scientific soundness, but the manuscript provides no details on sample size, selection method for the sample of tasks, human inter-annotator agreement, discipline-specific breakdown (e.g., across the four fields), or error analysis of the 12.5% disagreements. Because the training trajectories are generated using these scripts as the reward signal, this information is load-bearing for attributing the reported 7.8-point gain to genuine capability improvement rather than exploitation of script artifacts.
[§5] §5 (Training Experiments and Results): No ablation is reported that isolates training on D3-Gym trajectories from training on equivalent volumes of generic code-completion data or non-verifiable scientific tasks. Without such a control, the specificity of the gains to the verifiable scientific environments cannot be confirmed, weakening the central claim that D3-Gym trajectories drive the observed improvements on ScienceAgentBench.
[§5.2] §5.2 (Benchmark Transfer): The evaluation is conducted solely on ScienceAgentBench; the manuscript does not provide evidence or discussion of whether success on the 565 D3-Gym tasks (which are repository-derived but still structured) transfers to open-ended, non-benchmark scientific discovery workflows. This assumption is central to the broader motivation but remains untested.

minor comments (3)

[Abstract and §5] The abstract and results section should explicitly state the exact baselines, number of runs, and statistical tests used to establish the 7.8-point gain and other improvements.
[§3] Provide a table or appendix listing the distribution of the 565 tasks across the 239 repositories and four disciplines to allow assessment of coverage.
[§5] Figures reporting model performance should include error bars or confidence intervals and clarify whether the gains are statistically significant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us clarify the strengths and limitations of D3-Gym. We address each major comment point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Evaluation of Verification Signal): The 87.5% agreement with human-annotated gold standards is presented as evidence of scientific soundness, but the manuscript provides no details on sample size, selection method for the sample of tasks, human inter-annotator agreement, discipline-specific breakdown (e.g., across the four fields), or error analysis of the 12.5% disagreements. Because the training trajectories are generated using these scripts as the reward signal, this information is load-bearing for attributing the reported 7.8-point gain to genuine capability improvement rather than exploitation of script artifacts.

Authors: We agree that these methodological details are essential for readers to assess the robustness of the verification signal. The original manuscript omitted them for brevity. In the revised version, we have expanded §4 with a new subsection describing the human evaluation protocol. This includes the sample size and stratified random selection method across disciplines, inter-annotator agreement metrics, a per-discipline breakdown of agreement rates, and a qualitative error analysis of the disagreements. These additions directly support that the 87.5% figure reflects genuine scientific soundness rather than exploitable artifacts, thereby strengthening the link to the observed training gains. revision: yes
Referee: [§5] §5 (Training Experiments and Results): No ablation is reported that isolates training on D3-Gym trajectories from training on equivalent volumes of generic code-completion data or non-verifiable scientific tasks. Without such a control, the specificity of the gains to the verifiable scientific environments cannot be confirmed, weakening the central claim that D3-Gym trajectories drive the observed improvements on ScienceAgentBench.

Authors: We acknowledge that an explicit ablation would provide stronger causal evidence for the contribution of D3-Gym's verifiable scientific tasks. Performing additional large-scale training runs on matched volumes of generic code data was not feasible within our computational budget. In the revised manuscript, we have added a discussion paragraph in §5 that explains the distinction between D3-Gym (real repository-derived tasks with domain-specific data and automatic verifiers) and generic code completion. We further note that ScienceAgentBench itself emphasizes scientific reasoning and tool use, making generic coding an unlikely sole explanation for the consistent gains across model sizes. This limitation is now explicitly stated, with the ablation suggested as valuable future work. revision: partial
Referee: [§5.2] §5.2 (Benchmark Transfer): The evaluation is conducted solely on ScienceAgentBench; the manuscript does not provide evidence or discussion of whether success on the 565 D3-Gym tasks (which are repository-derived but still structured) transfers to open-ended, non-benchmark scientific discovery workflows. This assumption is central to the broader motivation but remains untested.

Authors: We agree that transfer to fully open-ended discovery workflows is an important untested aspect of the broader motivation. Our current results focus on transfer to ScienceAgentBench, which comprises real scientific tasks drawn from the literature. In the revised manuscript, we have expanded the discussion in §5.2 and added a dedicated Limitations paragraph that explicitly addresses the structured nature of D3-Gym tasks as a necessary foundation for verifiable training. We outline why this step is prerequisite to open-ended evaluation and propose concrete directions for future work, such as integration with live discovery pipelines. This revision clarifies the scope of our claims without overstatement. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical construction evaluated on external benchmark

full rationale

The paper constructs D3-Gym by sourcing 565 tasks from 239 external repositories, equips each with an executable environment and auto-synthesized evaluation script, reports 87.5% agreement with human gold standards on a sample as an empirical measurement, and measures training gains on the independent external benchmark ScienceAgentBench. No equations, derivations, or fitted parameters are present that reduce the reported performance improvements to quantities defined by the authors' own inputs or prior self-citations. No uniqueness theorems, ansatzes, or load-bearing self-citations are invoked. The central claim is a standard data-construction-plus-transfer result whose validity rests on external benchmarks and sample validation rather than self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human judgments of code correctness constitute a reliable gold standard and that the sampled tasks are representative of scientific discovery workflows; no free parameters are fitted to produce the main numbers, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Human-annotated judgments of code correctness serve as reliable gold standards for scientific validity.
Invoked to validate the 87.5% agreement of the auto-generated evaluation scripts.

pith-pipeline@v0.9.0 · 5563 in / 1496 out tokens · 99130 ms · 2026-05-07T07:00:13.447466+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, and Pengfei Liu

URLhttps://openreview.net/forum?id=6z4YKr0GK6. Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, and Pengfei Liu. davinci- env: Open swe environment synthesis at scale, 2026. URL https://arxiv.org/abs/2603. 13023. Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao ...

work page doi:10.18653/v1/2025.emnlp-main.1546 2026
[2]

Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai

URLhttps://openreview.net/forum?id=yeVBHPLXxi. Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. MLE-dojo: Interactive environments for empowering LLM agents in machine learning engineering. InThe Thirty-ninth Annual Conference on Neural Information Proc...

work page doi:10.1038/d41586-023-01295-4 2025
[3]

doi: 10.18653/v1/2024.emnlp-tutorials.3

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-tutorials.3. URLhttps://aclanthology.org/2024.emnlp-tutorials.3/. 12 Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum...

work page doi:10.18653/v1/2024.emnlp-tutorials.3 2024
[4]

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

URLhttps://arxiv.org/abs/2604.24955. Zihan Wang, Siyao Liu, Yang Sun, Ming Ding, and Hongyan Li. CodeContests+: High- quality test case generation for competitive programming. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 5576–5600, Suzh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 2025
[5]

For each path: • Verify whether the file exists on the current file system

Identify Data Files.Inspect the Python program and extract every file path that it attempts to open, read, load, or process. For each path: • Verify whether the file exists on the current file system. 15 • Read a small portion of the file to confirm it contains real, meaningful data (not empty, not placeholder, not all zeros, not constant meaningless valu...
[6]

PNG images, 224×224, RGB

Render Verdict.If dummy data == 0 and has mock == 0 , set valid = 1 ; otherwise set valid = 0 . If valid = 0 , skip Phase 2 and proceed directly to the output. Phase 2: Dataset Preview Generation(only ifvalid = 1). For each data file referenced by the program, generate a preview file showing only the raw data schema: • For CSV or tabular data: the header ...
[7]

2.Dataset previewfiles showing the structure and content of the input data

Atask instructiondescribing the analysis to perform. 2.Dataset previewfiles showing the structure and content of the input data
[8]

The gold results were generated by synthesized programs

A list of output files created inpred results/and their content (a.k.a gold results). The gold results were generated by synthesized programs. Complete correctness isnotguaranteed due to possible errors in code synthesis, data processing, or the execution environment. Evaluation Criteria.For a task toPASS,allof the following must hold:
[9]

valid": true|false,

All requested output files must exist.Read the task instruction to identify every output file that should be produced. Verify each one appears in the output files listing. If any requested file is missing, the taskFAILS. 2.Each output file must be valid.For every file, verify that: • It is non-empty and contains substantive content (no placeholders, no “T...
[10]

Success Thresholds.Define specific threshold values for each metric, grounded in domain standards and task complexity
[11]

5.Evaluation Steps.Provide a clear 3–5 step evaluation procedure that the coding agent should follow

Special Considerations.Note any domain-specific requirements such as handling of missing data, biological versus statistical significance, cross-validation needs, or output format constraints. 5.Evaluation Steps.Provide a clear 3–5 step evaluation procedure that the coding agent should follow. Output Requirements.The plan must be concise, unambiguous, and...
[12]

It returns a tuple (result, message) where resultis a boolean (True/False) indicating pass/fail andmessageis a string with details

Function signature:Define a top-level eval() function taking no parameters. It returns a tuple (result, message) where resultis a boolean (True/False) indicating pass/fail andmessageis a string with details
[13]

__main__

Visual evaluation(for tasks producing plots or images): use an LLM-as-a-judge approach to compare the predicted and reference visual outputs. 4.Main block: if __name__ == "__main__": ok, msg = eval() print(ok, msg)
[14]

Error:{e}

Error handling:Wrap the body of eval() in a try/except so that unexpected errors return (False, f"Error:{e}") rather than crashing
[15]

Missing file:

File existence checks:Verify that both predicted and gold files exist before loading. Return (False, "Missing file: ...") if any file is absent. Output Format.Respond withonlythe Python source code for the evaluation script. Do not include any explanation or markdown formatting. C.3 LLM Judge Agreement with Human To validate the reliability of the LLM-as-...

2025