Recognition: unknown
D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3
The pith
D3-Gym supplies 565 verifiable environments from real scientific repositories to train language-model agents for data-driven discovery.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
D3-Gym is built by sourcing tasks from real scientific repositories and automatically equipping each one with an executable environment, input data, reference code, and a synthesized evaluation script. Rigorous checks show the scripts agree with human annotations at 87.5 percent while preserving scientific evaluation logic. Training on trajectories generated inside these environments produces consistent gains across Qwen3 model sizes on ScienceAgentBench, raising the 32B model by 7.8 points and shrinking the gap to strong closed-source models.
What carries the argument
D3-Gym, a dataset of 565 tasks each pairing a real scientific problem with an executable environment and an automatically synthesized evaluation script that checks solution correctness against domain-specific scientific criteria.
Load-bearing premise
The automatically synthesized evaluation scripts correctly capture scientific soundness for the full set of tasks and that success on these tasks transfers to genuine open-ended discovery rather than mere code completion.
What would settle it
Human scientists reviewing a large random sample of model solutions on D3-Gym tasks find agreement with the auto-generated scripts well below 87.5 percent, or models trained on D3-Gym trajectories show no improvement on a fresh collection of real laboratory or field discovery problems outside ScienceAgentBench.
Figures
read the original abstract
Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces D3-Gym, the first automatically constructed dataset of 565 verifiable environments for scientific data-driven discovery tasks, sourced from 239 real repositories across four disciplines. Each task provides a natural language instruction, executable environment with dependencies, input data and previews, reference code solution, and an automatically synthesized evaluation script. The authors report that these scripts achieve 87.5% agreement with human gold standards and strong domain-specific alignment. They further show that training on trajectories sampled from D3-Gym produces consistent gains on the external ScienceAgentBench benchmark, including a 7.8 absolute point improvement for Qwen3-32B and a reduced gap to proprietary models. All artifacts are released.
Significance. If the verification signal proves reliable across the full task set, D3-Gym would be a valuable resource for scaling training of agents on real scientific workflows rather than synthetic or non-verifiable code tasks. The empirical transfer result on ScienceAgentBench is concrete and the public release of environments, trajectories, and models supports reproducibility. The work addresses a genuine bottleneck in data-driven discovery agents, though the extent of transfer beyond the benchmark to open-ended discovery requires further evidence.
major comments (3)
- [§4] §4 (Evaluation of Verification Signal): The 87.5% agreement with human-annotated gold standards is presented as evidence of scientific soundness, but the manuscript provides no details on sample size, selection method for the sample of tasks, human inter-annotator agreement, discipline-specific breakdown (e.g., across the four fields), or error analysis of the 12.5% disagreements. Because the training trajectories are generated using these scripts as the reward signal, this information is load-bearing for attributing the reported 7.8-point gain to genuine capability improvement rather than exploitation of script artifacts.
- [§5] §5 (Training Experiments and Results): No ablation is reported that isolates training on D3-Gym trajectories from training on equivalent volumes of generic code-completion data or non-verifiable scientific tasks. Without such a control, the specificity of the gains to the verifiable scientific environments cannot be confirmed, weakening the central claim that D3-Gym trajectories drive the observed improvements on ScienceAgentBench.
- [§5.2] §5.2 (Benchmark Transfer): The evaluation is conducted solely on ScienceAgentBench; the manuscript does not provide evidence or discussion of whether success on the 565 D3-Gym tasks (which are repository-derived but still structured) transfers to open-ended, non-benchmark scientific discovery workflows. This assumption is central to the broader motivation but remains untested.
minor comments (3)
- [Abstract and §5] The abstract and results section should explicitly state the exact baselines, number of runs, and statistical tests used to establish the 7.8-point gain and other improvements.
- [§3] Provide a table or appendix listing the distribution of the 565 tasks across the 239 repositories and four disciplines to allow assessment of coverage.
- [§5] Figures reporting model performance should include error bars or confidence intervals and clarify whether the gains are statistically significant.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us clarify the strengths and limitations of D3-Gym. We address each major comment point by point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation of Verification Signal): The 87.5% agreement with human-annotated gold standards is presented as evidence of scientific soundness, but the manuscript provides no details on sample size, selection method for the sample of tasks, human inter-annotator agreement, discipline-specific breakdown (e.g., across the four fields), or error analysis of the 12.5% disagreements. Because the training trajectories are generated using these scripts as the reward signal, this information is load-bearing for attributing the reported 7.8-point gain to genuine capability improvement rather than exploitation of script artifacts.
Authors: We agree that these methodological details are essential for readers to assess the robustness of the verification signal. The original manuscript omitted them for brevity. In the revised version, we have expanded §4 with a new subsection describing the human evaluation protocol. This includes the sample size and stratified random selection method across disciplines, inter-annotator agreement metrics, a per-discipline breakdown of agreement rates, and a qualitative error analysis of the disagreements. These additions directly support that the 87.5% figure reflects genuine scientific soundness rather than exploitable artifacts, thereby strengthening the link to the observed training gains. revision: yes
-
Referee: [§5] §5 (Training Experiments and Results): No ablation is reported that isolates training on D3-Gym trajectories from training on equivalent volumes of generic code-completion data or non-verifiable scientific tasks. Without such a control, the specificity of the gains to the verifiable scientific environments cannot be confirmed, weakening the central claim that D3-Gym trajectories drive the observed improvements on ScienceAgentBench.
Authors: We acknowledge that an explicit ablation would provide stronger causal evidence for the contribution of D3-Gym's verifiable scientific tasks. Performing additional large-scale training runs on matched volumes of generic code data was not feasible within our computational budget. In the revised manuscript, we have added a discussion paragraph in §5 that explains the distinction between D3-Gym (real repository-derived tasks with domain-specific data and automatic verifiers) and generic code completion. We further note that ScienceAgentBench itself emphasizes scientific reasoning and tool use, making generic coding an unlikely sole explanation for the consistent gains across model sizes. This limitation is now explicitly stated, with the ablation suggested as valuable future work. revision: partial
-
Referee: [§5.2] §5.2 (Benchmark Transfer): The evaluation is conducted solely on ScienceAgentBench; the manuscript does not provide evidence or discussion of whether success on the 565 D3-Gym tasks (which are repository-derived but still structured) transfers to open-ended, non-benchmark scientific discovery workflows. This assumption is central to the broader motivation but remains untested.
Authors: We agree that transfer to fully open-ended discovery workflows is an important untested aspect of the broader motivation. Our current results focus on transfer to ScienceAgentBench, which comprises real scientific tasks drawn from the literature. In the revised manuscript, we have expanded the discussion in §5.2 and added a dedicated Limitations paragraph that explicitly addresses the structured nature of D3-Gym tasks as a necessary foundation for verifiable training. We outline why this step is prerequisite to open-ended evaluation and propose concrete directions for future work, such as integration with live discovery pipelines. This revision clarifies the scope of our claims without overstatement. revision: yes
Circularity Check
No significant circularity: empirical construction evaluated on external benchmark
full rationale
The paper constructs D3-Gym by sourcing 565 tasks from 239 external repositories, equips each with an executable environment and auto-synthesized evaluation script, reports 87.5% agreement with human gold standards on a sample as an empirical measurement, and measures training gains on the independent external benchmark ScienceAgentBench. No equations, derivations, or fitted parameters are present that reduce the reported performance improvements to quantities defined by the authors' own inputs or prior self-citations. No uniqueness theorems, ansatzes, or load-bearing self-citations are invoked. The central claim is a standard data-construction-plus-transfer result whose validity rests on external benchmarks and sample validation rather than self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-annotated judgments of code correctness serve as reliable gold standards for scientific validity.
Reference graph
Works this paper leans on
-
[1]
URLhttps://openreview.net/forum?id=6z4YKr0GK6. Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, and Pengfei Liu. davinci- env: Open swe environment synthesis at scale, 2026. URL https://arxiv.org/abs/2603. 13023. Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao ...
-
[2]
URLhttps://openreview.net/forum?id=yeVBHPLXxi. Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. MLE-dojo: Interactive environments for empowering LLM agents in machine learning engineering. InThe Thirty-ninth Annual Conference on Neural Information Proc...
-
[3]
doi: 10.18653/v1/2024.emnlp-tutorials.3
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-tutorials.3. URLhttps://aclanthology.org/2024.emnlp-tutorials.3/. 12 Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum...
-
[4]
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
URLhttps://arxiv.org/abs/2604.24955. Zihan Wang, Siyao Liu, Yang Sun, Ming Ding, and Hongyan Li. CodeContests+: High- quality test case generation for competitive programming. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 5576–5600, Suzh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 2025
-
[5]
For each path: • Verify whether the file exists on the current file system
Identify Data Files.Inspect the Python program and extract every file path that it attempts to open, read, load, or process. For each path: • Verify whether the file exists on the current file system. 15 • Read a small portion of the file to confirm it contains real, meaningful data (not empty, not placeholder, not all zeros, not constant meaningless valu...
-
[6]
PNG images, 224×224, RGB
Render Verdict.If dummy data == 0 and has mock == 0 , set valid = 1 ; otherwise set valid = 0 . If valid = 0 , skip Phase 2 and proceed directly to the output. Phase 2: Dataset Preview Generation(only ifvalid = 1). For each data file referenced by the program, generate a preview file showing only the raw data schema: • For CSV or tabular data: the header ...
-
[7]
2.Dataset previewfiles showing the structure and content of the input data
Atask instructiondescribing the analysis to perform. 2.Dataset previewfiles showing the structure and content of the input data
-
[8]
The gold results were generated by synthesized programs
A list of output files created inpred results/and their content (a.k.a gold results). The gold results were generated by synthesized programs. Complete correctness isnotguaranteed due to possible errors in code synthesis, data processing, or the execution environment. Evaluation Criteria.For a task toPASS,allof the following must hold:
-
[9]
valid": true|false,
All requested output files must exist.Read the task instruction to identify every output file that should be produced. Verify each one appears in the output files listing. If any requested file is missing, the taskFAILS. 2.Each output file must be valid.For every file, verify that: • It is non-empty and contains substantive content (no placeholders, no “T...
-
[10]
Success Thresholds.Define specific threshold values for each metric, grounded in domain standards and task complexity
-
[11]
5.Evaluation Steps.Provide a clear 3–5 step evaluation procedure that the coding agent should follow
Special Considerations.Note any domain-specific requirements such as handling of missing data, biological versus statistical significance, cross-validation needs, or output format constraints. 5.Evaluation Steps.Provide a clear 3–5 step evaluation procedure that the coding agent should follow. Output Requirements.The plan must be concise, unambiguous, and...
-
[12]
It returns a tuple (result, message) where resultis a boolean (True/False) indicating pass/fail andmessageis a string with details
Function signature:Define a top-level eval() function taking no parameters. It returns a tuple (result, message) where resultis a boolean (True/False) indicating pass/fail andmessageis a string with details
-
[13]
__main__
Visual evaluation(for tasks producing plots or images): use an LLM-as-a-judge approach to compare the predicted and reference visual outputs. 4.Main block: if __name__ == "__main__": ok, msg = eval() print(ok, msg)
-
[14]
Error:{e}
Error handling:Wrap the body of eval() in a try/except so that unexpected errors return (False, f"Error:{e}") rather than crashing
-
[15]
Missing file:
File existence checks:Verify that both predicted and gold files exist before loading. Return (False, "Missing file: ...") if any file is absent. Output Format.Respond withonlythe Python source code for the evaluation script. Do not include any explanation or markdown formatting. C.3 LLM Judge Agreement with Human To validate the reliability of the LLM-as-...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.