pith. sign in

arxiv: 2512.01089 · v2 · pith:AGD3YZH3new · submitted 2025-11-30 · 💻 cs.AI

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

Pith reviewed 2026-05-21 18:10 UTC · model grok-4.3

classification 💻 cs.AI
keywords automated scientific discoverycode distillationgithub repositoriesmaterials sciencellm agentscode generationdomain-specific libraries
0
0 comments X

The pith

CodeDistiller automatically extracts working code examples from scientific repositories to let discovery agents generate more accurate experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CodeDistiller as a way to turn large sets of scientific GitHub repositories into ready-to-use libraries of functional code. These libraries give automated scientific discovery agents concrete examples they can adapt instead of starting only from general knowledge or a handful of hand-written templates. On 250 materials-science repositories the system produced usable code for 74 percent of them. Agents that received the distilled library then wrote experiments rated higher in accuracy, completeness, and scientific validity than agents given only broad materials-science code snippets. The work also checks whether large language models can stand in for human experts when judging the quality of generated experiments.

Core claim

CodeDistiller processes collections of scientific GitHub repositories to extract and vet functional domain-specific code examples, enabling ASD agents augmented with these libraries to generate more accurate, complete, and scientifically sound experiments than agents relying solely on general materials-science code examples.

What carries the argument

CodeDistiller, a pipeline that combines automatic extraction with domain-expert filtering to turn raw repositories into a vetted library of working code examples.

Load-bearing premise

The code examples pulled from the 250 repositories will apply to new experimental tasks and any measured gains come from the library rather than changes in prompting or scoring rules.

What would settle it

A controlled test in which the same agent prompt and evaluation rubric are used on fresh materials-science tasks, once with the CodeDistiller library and once without, followed by domain-expert scoring of the resulting experiment code.

Figures

Figures reproduced from arXiv: 2512.01089 by Peter Jansen, Pragnya Narasimha, Samiah Hassan.

Figure 1
Figure 1. Figure 1: CODEDISTILLER distills a large collection of GITHUB repositories into a library of reusable scientific code, allowing CODE-RAG style scientific discovery agents to per￾form tasks beyond their parametric knowledge. and conclusions, as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the core stages of the CODEDISTILLER workflow, including identifying the core purpose of the repository, identifying files relevant for building an example, and the example generation and debugging process. tionality. Similarly, as as a pragmatic considera￾tion, the base language model used for generat￾ing and debugging the code example has a limited context window, and by providing only the… view at source ↗
Figure 3
Figure 3. Figure 3: Results of A/B testing, showing the proportion of times the judge model preferred the experimental output from the baseline model (with generic materials science code examples) versus the model augmented with a CODEDIS￾TILLER-generated library. Values represent the average of 50 experimental tasks implemented using CODESCIENTIST. Resource Cost: Overall, the models share similar runtimes, and similar number… view at source ↗
read the original abstract

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples. We also evaluate LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement and suggesting that inexpensive proxy metrics may be feasible for evaluating scientific discovery systems at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeDistiller, a system that automatically distills large collections of scientific GitHub repositories into a vetted library of working domain-specific code examples for Automated Scientific Discovery (ASD) agents. On 250 materials science repositories, the best model produces functional examples for 74% of repositories. Downstream evaluation shows an ASD agent augmented with the CodeDistiller-generated library produces more accurate, complete, and scientifically sound experiments than an agent using only general materials-science code examples. The work also evaluates LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement.

Significance. If the attribution of performance gains holds, this approach could meaningfully expand the capabilities of ASD systems by enabling scalable, automatic augmentation with domain-specific code libraries, reducing reliance on manual example crafting. The combination of automatic filtering, expert validation, and downstream agent testing on real repositories provides a practical path forward, and the LLM-judge evaluation offers a promising direction for scalable assessment of scientific coding agents.

major comments (2)
  1. [Downstream evaluation] Downstream evaluation: The abstract reports that an ASD agent augmented with a CodeDistiller-generated library outperforms one with only general materials-science code examples in accuracy, completeness, and scientific soundness, but provides no indication that example cardinality, formatting, retrieval mechanism, or system prompt were held constant across conditions. This is load-bearing for the central claim that gains are attributable to the distilled functional examples rather than incidental differences in prompting or example count.
  2. [Evaluation on 250 materials science repositories] Repository evaluation: The claim of 74% functional examples from 250 repositories lacks details on exact filtering criteria, statistical significance testing, or controls for confounding factors in the agent comparison, as required to interpret the success rate and support the generalization assumption.
minor comments (2)
  1. [Abstract] The abstract could more explicitly define 'functional' examples and list the specific models evaluated to reach the 74% figure.
  2. [LLM-as-a-judge evaluation] Consider adding a table or figure summarizing the expert vs. LLM-judge agreement metrics (e.g., Cohen's kappa or percentage agreement) for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the clarity and rigor of our experimental claims.

read point-by-point responses
  1. Referee: [Downstream evaluation] Downstream evaluation: The abstract reports that an ASD agent augmented with a CodeDistiller-generated library outperforms one with only general materials-science code examples in accuracy, completeness, and scientific soundness, but provides no indication that example cardinality, formatting, retrieval mechanism, or system prompt were held constant across conditions. This is load-bearing for the central claim that gains are attributable to the distilled functional examples rather than incidental differences in prompting or example count.

    Authors: We agree that explicit confirmation of these controls is essential to support attribution of the observed gains. In the revised manuscript we have added a dedicated paragraph in the Downstream Evaluation section that states all conditions used identical example cardinality, identical formatting of code snippets, the same retrieval mechanism (embedding-based similarity with fixed top-k selection), and the same system-prompt template, differing only in the content of the provided code library. The full prompts and retrieval parameters are now included in the appendix. These revisions directly address the concern and make the experimental design transparent. revision: yes

  2. Referee: [Evaluation on 250 materials science repositories] Repository evaluation: The claim of 74% functional examples from 250 repositories lacks details on exact filtering criteria, statistical significance testing, or controls for confounding factors in the agent comparison, as required to interpret the success rate and support the generalization assumption.

    Authors: We acknowledge that the original manuscript provided insufficient detail on these points. The revised version expands the Repository Evaluation section with the precise filtering criteria used to arrive at the 250 repositories, a clear definition of the 74% success rate (proportion of repositories yielding at least one expert-validated functional example), and bootstrap-derived 95% confidence intervals for the reported rate. Controls for the downstream agent comparison are now cross-referenced to the updated experimental-setup description. These additions improve reproducibility and allow readers to assess the strength of the generalization claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on held-out data

full rationale

The paper's claims rest on direct empirical evaluation: automatic and expert assessment of functional code extraction success across 250 repositories (yielding a 74% rate for the best model) and comparative downstream runs of an ASD agent with versus without the distilled library. These are reported as observed performance differences rather than quantities derived from fitted parameters, self-referential definitions, or predictions that reduce to the evaluation inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked that would create a self-definitional loop. The comparison baseline uses general materials-science examples as an external reference, and the evaluation is described as using held-out repositories and domain-expert ratings, keeping the results independent of any internal fitting process within the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes that a combination of automatic checks and limited expert review can reliably identify functional, reusable code examples across repositories; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Repositories on GitHub contain sufficient high-quality, executable scientific code that can be automatically filtered into reusable examples.
    Invoked when claiming that distillation from 250 repositories yields a useful library.

pith-pipeline@v0.9.0 · 5720 in / 1205 out tokens · 49043 ms · 2026-05-21T18:10:04.335642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.702 SUPER : Evaluating agents on setting up and executing tasks from research repositories . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, page...

  2. [2]

    M. Bran, A. Cox, O. Schilter, and 1 others. 2024. https://doi.org/10.1038/s42256-024-00832-8 Augmenting large language models with chemistry tools . Nature Machine Intelligence, 6:525--535

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \'e , Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 34 others. 2021. https://api.semanticscholar.org/CorpusID:235755472 Evaluating lar...

  4. [4]

    Nicholas Edwards, Yukyung Lee, Yujun (Audrey) Mao, Yulu Qin, Sebastian Schuster, and Najoung Kim. 2025. Rexbench: Can coding agents autonomously implement ai research extensions? arXiv preprint

  5. [5]

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. 2025. https://openreview.net/forum?id=izy1oaAOeX Envbench: A benchmark for automated environment setup . In ICLR 2025 Third Workshop on Deep Learning for Code

  6. [6]

    Ronald A. Fisher. 1935. The Design of Experiments. Oliver and Boyd, Edinburgh, UK

  7. [7]

    Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

    Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. 2025. https://api.semanticscholar.org/CorpusID:279119993 Researchcodebench: Benchmarking llms on implementing novel machine learning research code . ArXiv, abs/2506.02314

  8. [8]

    Carter, Xin Zhou, Matthew Wheeler, Jonathan A

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, and 4 others. 2025. https://doi.org/10.1101/2025.05.30.656746 Biomni: A general-purpose biomedical ai ag...

  9. [9]

    Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. 2025. https://doi.org/10.18653/v1/2025.findings-acl.692 C ode S cientist: End-to-end semi-automated scientific discovery with code-based experimentation . In Findings of the Association for Computati...

  10. [10]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. https://api.semanticscholar.org/CorpusID:270214176 A survey on large language models for code generation . ACM Transactions on Software Engineering and Methodology

  11. [11]

    Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, Cambridge, UK

  12. [12]

    Pat Langley and Jan M. Zytkow. 1989. https://doi.org/10.1016/0004-3702(89)90051-9 Data-driven approaches to empirical discovery . Artificial Intelligence, 40(1):283--312

  13. [13]

    Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, and Lucas Caccia. 2025. https://arxiv.org/abs/2510.26790 Gistify! codebase-level understanding via runtime execution . Preprint, arXiv:2510.26790

  14. [14]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. https://arxiv.org/abs/2408.06292 The ai scientist: Towards fully automated open-ended scientific discovery . Preprint, arXiv:2408.06292

  15. [15]

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. 2024. Position: data-driven discovery with large generative models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  16. [16]

    Kosmos: An AI Scientist for Autonomous Discovery

    Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, and 18 others. 2025. https://arxiv.org/abs/2511...

  17. [17]

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. https://api.semanticscholar.org/CorpusID:275358017 Agent laboratory: Using llm agents as research assistants . ArXiv, abs/2501.04227

  18. [18]

    Simon, Pat Langley, and Gary L

    Herbert A. Simon, Pat Langley, and Gary L. Bradshaw. 1981. https://api.semanticscholar.org/CorpusID:46985700 Scientific discovery as problem solving . Synthese, 47:1--27

  19. [19]

    Don R. Swanson. 1986. https://doi.org/10.1353/pbm.1986.0087 Fish oil, raynaud's syndrome, and undiscovered public knowledge . Perspectives in Biology and Medicine, 30(1):7--18

  20. [20]

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. https://doi.org/10.18653/v1/2024.acl-long.511 Large language models are not fair evaluators . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440...

  21. [21]

    Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. 2025. https://api.semanticscholar.org/CorpusID:281681196 Deepscientist: Advancing frontier-pushing scientific findings progressively . ArXiv, abs/2509.26603

  22. [22]

    Georg W \"o lflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, and Jakob Nikolas Kather. 2025. https://doi.org/10.18653/v1/2025.acl-long.1266 LLM agents making agent tools . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26092--26130, Vienna, Austria. Association for Computati...

  23. [23]

    Zonglin Yang, Xinya Du, Junxian Li, Jie Zheng, Soujanya Poria, and Erik Cambria. 2024. https://doi.org/10.18653/v1/2024.findings-acl.804 Large language models for automated open-domain scientific hypotheses discovery . In Findings of the Association for Computational Linguistics: ACL 2024, pages 13545--13565, Bangkok, Thailand. Association for Computation...

  24. [24]

    Bo Zhang, Shi Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, and 5 others. 2025. https://api.semanticscholar.org/CorpusID:278788499 Novelseek: When agent becomes the scientist - bu...

  25. [25]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://openreview.net/forum?id=uccHPGDlao Judging LLM -as-a-judge with MT -bench and chatbot arena . In Thirty-seventh Conference on Neural Information Processing Systems Da...

  26. [26]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  27. [27]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...