CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
Pith reviewed 2026-05-21 18:10 UTC · model grok-4.3
The pith
CodeDistiller automatically extracts working code examples from scientific repositories to let discovery agents generate more accurate experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeDistiller processes collections of scientific GitHub repositories to extract and vet functional domain-specific code examples, enabling ASD agents augmented with these libraries to generate more accurate, complete, and scientifically sound experiments than agents relying solely on general materials-science code examples.
What carries the argument
CodeDistiller, a pipeline that combines automatic extraction with domain-expert filtering to turn raw repositories into a vetted library of working code examples.
Load-bearing premise
The code examples pulled from the 250 repositories will apply to new experimental tasks and any measured gains come from the library rather than changes in prompting or scoring rules.
What would settle it
A controlled test in which the same agent prompt and evaluation rubric are used on fresh materials-science tasks, once with the CodeDistiller library and once without, followed by domain-expert scoring of the resulting experiment code.
Figures
read the original abstract
Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples. We also evaluate LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement and suggesting that inexpensive proxy metrics may be feasible for evaluating scientific discovery systems at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeDistiller, a system that automatically distills large collections of scientific GitHub repositories into a vetted library of working domain-specific code examples for Automated Scientific Discovery (ASD) agents. On 250 materials science repositories, the best model produces functional examples for 74% of repositories. Downstream evaluation shows an ASD agent augmented with the CodeDistiller-generated library produces more accurate, complete, and scientifically sound experiments than an agent using only general materials-science code examples. The work also evaluates LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement.
Significance. If the attribution of performance gains holds, this approach could meaningfully expand the capabilities of ASD systems by enabling scalable, automatic augmentation with domain-specific code libraries, reducing reliance on manual example crafting. The combination of automatic filtering, expert validation, and downstream agent testing on real repositories provides a practical path forward, and the LLM-judge evaluation offers a promising direction for scalable assessment of scientific coding agents.
major comments (2)
- [Downstream evaluation] Downstream evaluation: The abstract reports that an ASD agent augmented with a CodeDistiller-generated library outperforms one with only general materials-science code examples in accuracy, completeness, and scientific soundness, but provides no indication that example cardinality, formatting, retrieval mechanism, or system prompt were held constant across conditions. This is load-bearing for the central claim that gains are attributable to the distilled functional examples rather than incidental differences in prompting or example count.
- [Evaluation on 250 materials science repositories] Repository evaluation: The claim of 74% functional examples from 250 repositories lacks details on exact filtering criteria, statistical significance testing, or controls for confounding factors in the agent comparison, as required to interpret the success rate and support the generalization assumption.
minor comments (2)
- [Abstract] The abstract could more explicitly define 'functional' examples and list the specific models evaluated to reach the 74% figure.
- [LLM-as-a-judge evaluation] Consider adding a table or figure summarizing the expert vs. LLM-judge agreement metrics (e.g., Cohen's kappa or percentage agreement) for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the clarity and rigor of our experimental claims.
read point-by-point responses
-
Referee: [Downstream evaluation] Downstream evaluation: The abstract reports that an ASD agent augmented with a CodeDistiller-generated library outperforms one with only general materials-science code examples in accuracy, completeness, and scientific soundness, but provides no indication that example cardinality, formatting, retrieval mechanism, or system prompt were held constant across conditions. This is load-bearing for the central claim that gains are attributable to the distilled functional examples rather than incidental differences in prompting or example count.
Authors: We agree that explicit confirmation of these controls is essential to support attribution of the observed gains. In the revised manuscript we have added a dedicated paragraph in the Downstream Evaluation section that states all conditions used identical example cardinality, identical formatting of code snippets, the same retrieval mechanism (embedding-based similarity with fixed top-k selection), and the same system-prompt template, differing only in the content of the provided code library. The full prompts and retrieval parameters are now included in the appendix. These revisions directly address the concern and make the experimental design transparent. revision: yes
-
Referee: [Evaluation on 250 materials science repositories] Repository evaluation: The claim of 74% functional examples from 250 repositories lacks details on exact filtering criteria, statistical significance testing, or controls for confounding factors in the agent comparison, as required to interpret the success rate and support the generalization assumption.
Authors: We acknowledge that the original manuscript provided insufficient detail on these points. The revised version expands the Repository Evaluation section with the precise filtering criteria used to arrive at the 250 repositories, a clear definition of the 74% success rate (proportion of repositories yielding at least one expert-validated functional example), and bootstrap-derived 95% confidence intervals for the reported rate. Controls for the downstream agent comparison are now cross-referenced to the updated experimental-setup description. These additions improve reproducibility and allow readers to assess the strength of the generalization claim. revision: yes
Circularity Check
No significant circularity; results are empirical measurements on held-out data
full rationale
The paper's claims rest on direct empirical evaluation: automatic and expert assessment of functional code extraction success across 250 repositories (yielding a 74% rate for the best model) and comparative downstream runs of an ASD agent with versus without the distilled library. These are reported as observed performance differences rather than quantities derived from fitted parameters, self-referential definitions, or predictions that reduce to the evaluation inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked that would create a self-definitional loop. The comparison baseline uses general materials-science examples as an external reference, and the evaluation is described as using held-out repositories and domain-expert ratings, keeping the results independent of any internal fitting process within the reported experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Repositories on GitHub contain sufficient high-quality, executable scientific code that can be automatically filtered into reusable examples.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce CODEDISTILLER, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an ASD agent augmented with a CodeDistiller-generated library produces more accurate, complete, and scientifically sound experiments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.702 SUPER : Evaluating agents on setting up and executing tasks from research repositories . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, page...
-
[2]
M. Bran, A. Cox, O. Schilter, and 1 others. 2024. https://doi.org/10.1038/s42256-024-00832-8 Augmenting large language models with chemistry tools . Nature Machine Intelligence, 6:525--535
-
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \'e , Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 34 others. 2021. https://api.semanticscholar.org/CorpusID:235755472 Evaluating lar...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Nicholas Edwards, Yukyung Lee, Yujun (Audrey) Mao, Yulu Qin, Sebastian Schuster, and Najoung Kim. 2025. Rexbench: Can coding agents autonomously implement ai research extensions? arXiv preprint
work page 2025
-
[5]
Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. 2025. https://openreview.net/forum?id=izy1oaAOeX Envbench: A benchmark for automated environment setup . In ICLR 2025 Third Workshop on Deep Learning for Code
work page 2025
-
[6]
Ronald A. Fisher. 1935. The Design of Experiments. Oliver and Boyd, Edinburgh, UK
work page 1935
-
[7]
Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber
Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. 2025. https://api.semanticscholar.org/CorpusID:279119993 Researchcodebench: Benchmarking llms on implementing novel machine learning research code . ArXiv, abs/2506.02314
-
[8]
Carter, Xin Zhou, Matthew Wheeler, Jonathan A
Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, and 4 others. 2025. https://doi.org/10.1101/2025.05.30.656746 Biomni: A general-purpose biomedical ai ag...
-
[9]
Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. 2025. https://doi.org/10.18653/v1/2025.findings-acl.692 C ode S cientist: End-to-end semi-automated scientific discovery with code-based experimentation . In Findings of the Association for Computati...
-
[10]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. https://api.semanticscholar.org/CorpusID:270214176 A survey on large language models for code generation . ACM Transactions on Software Engineering and Methodology
work page 2024
-
[11]
Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, Cambridge, UK
work page 2020
-
[12]
Pat Langley and Jan M. Zytkow. 1989. https://doi.org/10.1016/0004-3702(89)90051-9 Data-driven approaches to empirical discovery . Artificial Intelligence, 40(1):283--312
-
[13]
Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, and Lucas Caccia. 2025. https://arxiv.org/abs/2510.26790 Gistify! codebase-level understanding via runtime execution . Preprint, arXiv:2510.26790
-
[14]
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. https://arxiv.org/abs/2408.06292 The ai scientist: Towards fully automated open-ended scientific discovery . Preprint, arXiv:2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. 2024. Position: data-driven discovery with large generative models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org
work page 2024
-
[16]
Kosmos: An AI Scientist for Autonomous Discovery
Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, and 18 others. 2025. https://arxiv.org/abs/2511...
work page internal anchor Pith review arXiv 2025
-
[17]
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. https://api.semanticscholar.org/CorpusID:275358017 Agent laboratory: Using llm agents as research assistants . ArXiv, abs/2501.04227
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Simon, Pat Langley, and Gary L
Herbert A. Simon, Pat Langley, and Gary L. Bradshaw. 1981. https://api.semanticscholar.org/CorpusID:46985700 Scientific discovery as problem solving . Synthese, 47:1--27
work page 1981
-
[19]
Don R. Swanson. 1986. https://doi.org/10.1353/pbm.1986.0087 Fish oil, raynaud's syndrome, and undiscovered public knowledge . Perspectives in Biology and Medicine, 30(1):7--18
-
[20]
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. https://doi.org/10.18653/v1/2024.acl-long.511 Large language models are not fair evaluators . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440...
- [21]
-
[22]
Georg W \"o lflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, and Jakob Nikolas Kather. 2025. https://doi.org/10.18653/v1/2025.acl-long.1266 LLM agents making agent tools . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26092--26130, Vienna, Austria. Association for Computati...
-
[23]
Zonglin Yang, Xinya Du, Junxian Li, Jie Zheng, Soujanya Poria, and Erik Cambria. 2024. https://doi.org/10.18653/v1/2024.findings-acl.804 Large language models for automated open-domain scientific hypotheses discovery . In Findings of the Association for Computational Linguistics: ACL 2024, pages 13545--13565, Bangkok, Thailand. Association for Computation...
-
[24]
Bo Zhang, Shi Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, and 5 others. 2025. https://api.semanticscholar.org/CorpusID:278788499 Novelseek: When agent becomes the scientist - bu...
-
[25]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://openreview.net/forum?id=uccHPGDlao Judging LLM -as-a-judge with MT -bench and chatbot arena . In Thirty-seventh Conference on Neural Information Processing Systems Da...
work page 2023
-
[26]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[27]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.