MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Boris I. Yakobson; Chenmu Zhang

arxiv: 2604.02688 · v3 · pith:3NJV7XAMnew · submitted 2026-04-03 · ❄️ cond-mat.mtrl-sci · cs.SE

MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Chenmu Zhang , Boris I. Yakobson This is my paper

Pith reviewed 2026-05-25 07:12 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.SE

keywords LLM agentmaterials scienceautonomous workflowscode generationcomputational materialsmachine learning force fieldsferroelectric materialsguided autonomy

0 comments

The pith

An LLM agent writes and runs its own Python code to complete full materials exploration workflows with only light guidance on domain rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MatClaw as a code-first agent that generates and executes Python scripts directly, pulling in any installed libraries to run multi-step simulations on remote clusters without pre-written tool functions. Demonstrations on CuInP2S6 cover active-learning force-field training, Curie temperature prediction, and parameter searches, showing reliable code handling over multi-day runs. The agent uses a four-layer memory system and source-code retrieval to avoid context loss and reach near-99 percent API accuracy. It still misses tacit choices such as run lengths and sampling methods that researchers know from experience. Two simple additions, letting the agent read papers and receiving a few expert rules, close the gap and produce working end-to-end results.

Core claim

MatClaw is a code-first LLM agent that writes and executes Python directly, composing installed libraries into multi-code workflows on HPC clusters; a four-layer memory architecture and retrieval over domain source code keep execution coherent and raise per-step accuracy to approximately 99 percent; three full workflows on ferroelectric CuInP2S6 succeed after literature self-learning and expert-specified constraints supply the missing tacit knowledge on timescales, equilibration, and sampling.

What carries the argument

The code-first architecture that lets the agent write and execute arbitrary Python instead of calling fixed tool functions, supported by a four-layer memory system that preserves state across long workflows.

If this is right

Multi-code materials workflows no longer require manually written tool functions for each new library.
Four-layer memory keeps agent state stable over days-long runs without progressive loss.
Retrieval over source code raises reliable API use to near 99 percent per step.
Guided autonomy lets researchers supply high-level rules while the agent manages execution.
Further gains in code generation will widen the reachable scope of autonomous discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same code-first pattern could transfer to other fields that rely on Python libraries for simulation or data analysis.
Expert constraints might be packaged as reusable templates that new users apply to fresh material systems.
The four-layer memory design could support agent tasks outside materials science that span multiple days.
If the interventions scale, the main remaining barrier becomes the quality of the underlying language model rather than workflow engineering.

Load-bearing premise

Tacit domain knowledge such as appropriate simulation lengths and sampling choices can be supplied reliably through literature reading and a few expert rules without creating new errors or needing constant oversight.

What would settle it

A workflow in which the agent, after literature self-learning and constraint input, still selects simulation parameters that produce physically invalid results not caught by the code itself.

Figures

Figures reproduced from arXiv: 2604.02688 by Boris I. Yakobson, Chenmu Zhang.

**Figure 2.** Figure 2: Ferroelectric order parameter Q(T) = ⟨|η(t)|⟩ of monolayer CIPS from DeePMD MD, produced autonomously by MatClaw. Inset: side view of the CuInP2S6 monolayer structure. Open squares show the initial 60 ps sweep (last 30 ps averaged); filled circles show the final data after extending near-transition temperatures to 100 ps. The dashed line marks the estimated Tc = 261 K. Error bars are block-averaged standar… view at source ↗

**Figure 3.** Figure 3: Agent-driven heuristic search through (E, T) parameter space. Each point represents one E-field MD simulation on a 1×25×1 CIPS supercell (500 atoms). Color indicates the domino metric (slope of ⟨|∆t(d)|⟩ vs. site separation d). Gray crosses mark conditions where fewer than 30% of Cu sites flipped. The blue-circled point (Ez = −0.16 V/Å, T = 50 K, slope = 0.32 ps/site) is the best condition found. The dotte… view at source ↗

**Figure 4.** Figure 4: Domain wall propagation at the optimal condition ( [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Chunking method comparison on pymatgen code QA (300 questions, Gemini 3.0 Flash, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Pymatgen code QA accuracy (300 questions) across five LLMs, with and without RAG. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: QA accuracy across three domain libraries (Gemini 3.0 Flash). Without RAG, accuracy [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Existing LLM agents for computational materials science are constrained by pipeline-bounded architectures tied to specific simulation codes and by dependence on manually written tool functions that grow with task scope. We present MatClaw, a code-first agent that writes and executes Python directly, composing any installed domain library to orchestrate multi-code workflows on remote HPC clusters without predefined tool functions. To sustain coherent execution across multi-day workflows, MatClaw uses a four-layer memory architecture that prevents progressive context loss, and retrieval-augmented generation over domain source code that raises per-step API-call accuracy to ${\sim}$99 %. Three end-to-end demonstrations on ferroelectric CuInP2S6 (machine-learning force field training via active learning, Curie temperature prediction, and heuristic parameter-space search) reveal that the agent handles code generation reliably but struggles with tacit domain knowledge. The missing knowledge, such as appropriate simulation timescales, equilibration protocols, and sampling strategies, is the kind that researchers accumulate through experience but rarely formalize. Two lightweight interventions, literature self-learning and expert-specified constraints, bridge these gaps, defining a guided autonomy model in which the researcher provides high-level domain knowledge while the agent handles workflow execution. Our results demonstrate that the gap between guided and fully autonomous computational materials research is narrower than ever before: LLMs already handle code generation and scientific interpretation reliably, and the rapid improvement in their capabilities will accelerate materials discovery beyond what manual workflows can achieve. All code and benchmarks are open-source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MatClaw shows a code-first LLM agent can run multi-day materials workflows by writing its own Python, but the three demos all use the same material and still need expert constraints for the domain parts.

read the letter

MatClaw drops the usual fixed-tool setup and has the model write and run arbitrary Python against whatever libraries are installed on the cluster. The four-layer memory plus source-code RAG gets API accuracy near 99 percent across long runs. That combination is not in the pipeline agents they cite, and the open-source release plus the three concrete workflows (active-learning MLFF, Curie temperature, heuristic search) on CuInP2S6 make the implementation checkable.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MatClaw, a code-first LLM agent that directly writes and executes Python code to compose arbitrary installed domain libraries for multi-code materials workflows on remote HPC clusters, without relying on predefined tool functions. It incorporates a four-layer memory architecture to maintain coherence over multi-day runs and retrieval-augmented generation over domain source code to achieve ~99% per-step API accuracy. Three end-to-end demonstrations are reported on ferroelectric CuInP2S6: active-learning MLFF training, Curie-temperature prediction, and heuristic parameter-space search. The work concludes that the agent reliably handles code generation and interpretation but requires lightweight interventions (literature self-learning and expert-specified constraints) to address tacit domain knowledge gaps such as simulation timescales and equilibration protocols, thereby narrowing the gap between guided and fully autonomous computational materials research. All code and benchmarks are released as open source.

Significance. If the guided-autonomy model generalizes beyond the reported cases, the approach could meaningfully reduce the manual effort required for complex multi-code materials workflows and accelerate discovery. The explicit open-sourcing of code and benchmarks is a concrete strength that supports reproducibility and extension by the community. The reported ~99% API-call accuracy via RAG over source code provides a practical, measurable advance for long-horizon agent reliability in scientific computing.

major comments (2)

[Abstract] Abstract: All three end-to-end demonstrations (MLFF active learning, Curie-temperature prediction, and heuristic search) are executed exclusively on the single material CuInP2S6. This leaves untested whether literature self-learning plus expert-specified constraints close tacit-knowledge gaps (simulation timescales, equilibration protocols, sampling strategies) for chemically or structurally dissimilar systems without material-specific corrections or additional oversight, which directly bears on the central claim that the guided-to-autonomous gap has narrowed.
[Abstract] Abstract and demonstrations section: No quantitative breakdown is provided of intervention frequency, failure modes introduced by the constraints, or success rates across repeated independent runs, making it difficult to evaluate whether the reported workflows represent reliable guided autonomy or case-specific tuning.

minor comments (1)

[Abstract] Abstract: The claim of '~99 %' per-step API-call accuracy would benefit from an explicit statement of the evaluation protocol, baseline comparison, and number of steps sampled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, proposing targeted revisions to clarify scope and strengthen the evaluation of the guided-autonomy model.

read point-by-point responses

Referee: [Abstract] Abstract: All three end-to-end demonstrations (MLFF active learning, Curie-temperature prediction, and heuristic search) are executed exclusively on the single material CuInP2S6. This leaves untested whether literature self-learning plus expert-specified constraints close tacit-knowledge gaps (simulation timescales, equilibration protocols, sampling strategies) for chemically or structurally dissimilar systems without material-specific corrections or additional oversight, which directly bears on the central claim that the guided-to-autonomous gap has narrowed.

Authors: The demonstrations were deliberately focused on CuInP2S6 to enable in-depth tracing of code generation, memory usage, and tacit-knowledge interventions across multi-day workflows. The two interventions (literature self-learning and expert-specified constraints) are expressed in general terms rather than material-specific rules. We agree that explicit validation on chemically dissimilar systems would provide stronger evidence for generalizability. In revision we will (i) state the single-material scope explicitly in the abstract and (ii) add a short discussion subsection on how the intervention protocol could be applied to other systems, while noting that broader testing remains future work. revision: partial
Referee: [Abstract] Abstract and demonstrations section: No quantitative breakdown is provided of intervention frequency, failure modes introduced by the constraints, or success rates across repeated independent runs, making it difficult to evaluate whether the reported workflows represent reliable guided autonomy or case-specific tuning.

Authors: We will add a table in the demonstrations section that enumerates every intervention made in the three workflows, their frequency, and the concrete failure modes each constraint resolved. Because the study prioritized end-to-end feasibility over statistical benchmarking, repeated independent runs with success-rate statistics were not performed. We will therefore include an explicit limitations paragraph noting the absence of such statistics and identifying repeated-run evaluation as an important next step. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system demonstration without derivation chain

full rationale

The paper presents MatClaw as an empirical demonstration of an LLM agent executing three workflows on CuInP2S6, with claims resting on reported execution outcomes and open-source code rather than any mathematical derivation, fitted parameters, or predictions. No equations, self-definitional constructs, fitted-input predictions, or load-bearing self-citations appear in the provided text. The central claim that lightweight interventions bridge tacit-knowledge gaps is supported by the single-material results themselves, not by reduction to prior inputs or citations. This is a standard non-circular empirical report; the derivation chain is absent by design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an engineering demonstration of an agent system rather than a theoretical derivation, so it introduces no free parameters, no new physical axioms, and no invented entities. The central claims rest on standard assumptions about LLM code-generation capability and the utility of retrieval-augmented generation.

axioms (1)

domain assumption LLMs can generate correct, executable Python code for scientific library calls when given appropriate context and retrieval support.
Invoked throughout the description of the code-first agent and the reported ~99% API-call accuracy.

pith-pipeline@v0.9.0 · 5798 in / 1315 out tokens · 38698 ms · 2026-05-25T07:12:16.612673+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Agent-based learning of materials datasets from the scientific literature

Ansari, Mehrad and Moosavi, Seyed Mohamad. Agent-based learning of materials datasets from the scientific literature. Digital Discovery, 3 0 (12): 0 2607--2617, 2024. doi:10.1039/D4DD00252K

work page doi:10.1039/d4dd00252k 2024
[2]

Autonomous chemical research with large language models

Boiko, Daniil A., MacKnight, Robert, Kline, Ben, and Gomes, Gabe. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[3]

Augmenting large language models with chemistry tools

Bran, Andres M., Cox, Sam, Schilter, Oliver, Baldassari, Carlo, White, Andrew D., and Schwaller, Philippe. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6 0 (5): 0 525--535, 2024. doi:10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[4]

code-chunk: Tree-sitter based semantic code chunking, 2025

code-chunk contributors . code-chunk: Tree-sitter based semantic code chunking, 2025. https://github.com/nicobailon/code-chunk

work page 2025
[5]

A., and Buettcher, Stefan

Cormack, Gordon V., Clarke, Charles L. A., and Buettcher, Stefan. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proc. SIGIR, pages 758--759, 2009. doi:10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009
[6]

Atomate2: Modular workflows for materials science, 2025

Ganose, Alex, Sahasrabuddhe, Hrushikesh, et al. Atomate2: Modular workflows for materials science, 2025. URL https://chemrxiv.org/doi/full/10.26434/chemrxiv-2025-tcr5h. Digital Discovery, 2025, 4, 1944--1973

work page doi:10.26434/chemrxiv-2025-tcr5h 2025
[7]

He, R. et al. Unconventional ferroelectric domain switching dynamics in CuInP _2 S _6 from first principles. Phys. Rev. B, 108: 0 024305, 2023. doi:10.1103/PhysRevB.108.024305

work page doi:10.1103/physrevb.108.024305 2023
[8]

Context rot: How increasing input tokens impacts LLM performance, 2025

Hong, Kelly, Troynikov, Anton, and Huber, Jeff. Context rot: How increasing input tokens impacts LLM performance, 2025. URL https://www.trychroma.com/research/context-rot. Chroma Research Technical Report

work page 2025
[9]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, Carlos E., Yang, John, Wettig, Alexander, Yao, Shunyu, Pei, Kexin, Press, Ofir, and Narasimhan, Karthik. SWE-bench : Can language models resolve real-world GitHub issues?, 2024. URL http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

Kang, Minki, Chen, Wei-Ning, Han, Dongge, Inan, Huseyin A., Wutschitz, Lukas, Chen, Yanzhi, Sim, Robert, and Rajmohan, Saravan. ACON : Optimizing context compression for long-horizon LLM agents, 2025. URL https://arxiv.org/abs/2510.00615

work page internal anchor Pith review arXiv 2025
[11]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management, 2025

Lindenbauer, Tobias, Slinko, Igor, Felder, Ludwig, Bogomolov, Egor, and Zharov, Yaroslav. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management, 2025. URL http://arxiv.org/abs/2508.21433

work page arXiv 2025
[12]

VASPilot : MCP -facilitated multi-agent intelligence for autonomous VASP simulations

Liu, Jiaxuan, Zhu, Tiannian, Ye, Caiyuan, Fang, Zhong, Weng, Hongming, and Wu, Quansheng. VASPilot : MCP -facilitated multi-agent intelligence for autonomous VASP simulations. Chinese Physics B, 34 0 (11): 0 117106, 2025 a . doi:10.1088/1674-1056/ae0681

work page doi:10.1088/1674-1056/ae0681 2025
[13]

Lost in the Middle: How Language Models Use Long Contexts

Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni, Fabio, and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12: 0 157--173, 2024. doi:10.1162/tacl\_a\_00638

work page internal anchor Pith review doi:10.1162/tacl 2024
[14]

Liu, S. et al. MatTools : Benchmarking LLM tool-use for materials science, 2025 b . URL http://arxiv.org/abs/2505.10852. arXiv:2505.10852

work page arXiv 2025
[15]

Intrinsic ferroelectric switching from first principles

Liu, Shi, Grinberg, Ilya, and Rappe, Andrew M. Intrinsic ferroelectric switching from first principles. Nature, 534 0 (7607): 0 360--363, 2016. doi:10.1038/nature18286

work page doi:10.1038/nature18286 2016
[16]

Python M aterials G enomics (pymatgen): A robust, open-source P ython library for materials analysis

Ong, Shyue Ping, Richards, William Davidson, Jain, Anubhav, Hautier, Geoffroy, Kocher, Michael, Cholia, Shreyas, Gunter, Dan, Chevrier, Vincent L., Persson, Kristin A., and Ceder, Gerbrand. Python M aterials G enomics (pymatgen): A robust, open-source P ython library for materials analysis. Computational Materials Science, 68: 0 314--319, 2013. doi:10.101...

work page doi:10.1016/j.commatsci.2012.10.028 2013
[17]

MemGPT: Towards LLMs as Operating Systems

Packer, Charles, Wooders, Sarah, Lin, Kevin, Fang, Vivian, Patil, Shishir G., Stoica, Ion, and Gonzalez, Joseph E. MemGPT : Towards LLM s as operating systems, 2024. URL http://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Nanoscale studies of ferroelectric domain walls as pinned elastic interfaces

Paruch, Patrycja and Guyonnet, Jill. Nanoscale studies of ferroelectric domain walls as pinned elastic interfaces. Comptes Rendus Physique, 14 0 (8): 0 667--684, 2013. doi:10.1016/j.crhy.2013.08.004

work page doi:10.1016/j.crhy.2013.08.004 2013
[19]

Taskweaver: A code-first agent framework

Qiao, Bo, Li, Liqun, Zhang, Xu, He, Shilin, Kang, Yu, Zhang, Chaoyun, Yang, Fangkai, Dong, Hang, Zhang, Jue, Wang, Lu, Ma, Minghua, Zhao, Pu, Qin, Si, Qin, Xiaoting, Du, Chao, Xu, Yong, Lin, Qingwei, Rajmohan, Saravan, and Zhang, Dongmei. TaskWeaver : A code-first agent framework, 2024. URL http://arxiv.org/abs/2311.17541

work page arXiv 2024
[20]

GPQA : A graduate-level Google -proof Q&A benchmark

Rein, David, Hou, Betty Li, Stickland, Asa Cooper, Petty, Jackson, Pang, Richard Yuanzhe, Dirani, Julien, Michael, Julian, and Bowman, Samuel R. GPQA : A graduate-level Google -proof Q&A benchmark. Proc. COLM, 2024

work page 2024
[21]

Jobflow: Computational workflows made simple

Rosen, Andrew S., Gallant, Max, George, Janine, Riebesell, Janosh, Sahasrabuddhe, Hrushikesh, Shen, Jimmy-Xuan, Wen, Mingjian, Evans, Matthew L., Petretto, Guido, Waroquiers, David, Rignanese, Gian-Marco, Persson, Kristin A., Jain, Anubhav, and Ganose, Alex M. Jobflow: Computational workflows made simple. Journal of Open Source Software, 9 0 (93): 0 5995,...

work page doi:10.21105/joss.05995 2024
[22]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, Noah, Cassano, Federico, Berman, Edward, Gopinath, Ashwin, Narasimhan, Karthik, and Yao, Shunyu. Reflexion: Language agents with verbal reinforcement learning, 2023. URL http://arxiv.org/abs/2303.11366. NeurIPS 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Cognitive Architectures for Language Agents

Sumers, Theodore R., Yao, Shunyu, Narasimhan, Karthik, and Griffiths, Thomas L. Cognitive architectures for language agents, 2024. URL http://arxiv.org/abs/2309.02427

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Vriza, Aikaterini, Kornu, Uma, Koneru, Aditya, Chan, Henry, and Sankaranarayanan, Subramanian K. R. S. Multi-agentic AI framework for end-to-end atomistic simulations. Digital Discovery, 5 0 (1): 0 440--452, 2026. doi:10.1039/D5DD00435G

work page doi:10.1039/d5dd00435g 2026
[25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, Guanzhi, Xie, Yuqi, Jiang, Yunfan, Mandlekar, Ajay, Xiao, Chaowei, Zhu, Yuke, Fan, Linxi, and Anandkumar, Anima. Voyager: An open-ended embodied agent with large language models, 2023. URL http://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

DeePMD-kit : A deep learning package for many-body potential energy representation and molecular dynamics

Wang, Han, Zhang, Linfeng, Han, Jiequn, and E, Weinan. DeePMD-kit : A deep learning package for many-body potential energy representation and molecular dynamics. Computer Physics Communications, 228: 0 178--184, 2018. doi:10.1016/j.cpc.2018.03.016

work page doi:10.1016/j.cpc.2018.03.016 2018
[27]

Executable code actions elicit better LLM agents, 2024

Wang, Xingyao, Chen, Yangyi, Yuan, Lifan, Zhang, Yizhe, Li, Yunzhu, Peng, Hao, and Ji, Heng. Executable code actions elicit better LLM agents, 2024. URL http://arxiv.org/abs/2402.01030. ICML 2024

work page arXiv 2024
[28]

An agentic framework for autonomous materials computation, 2025

Xia, Zeyu, Ma, Jinzhe, Zheng, Congjie, Zhang, Shufei, Li, Yuqiang, Su, Hang, Hu, P., Zhang, Changshui, Gong, Xingao, Ouyang, Wanli, Bai, Lei, Zhou, Dongzhan, and Su, Mao. An agentic framework for autonomous materials computation, 2025. URL http://arxiv.org/abs/2512.19458. arXiv:2512.19458

work page arXiv 2025
[29]

Efficient streaming language models with attention sinks

Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, and Lewis, Mike. Efficient streaming language models with attention sinks. Proc. ICLR, 2024

work page 2024
[30]

ReAct : Synergizing reasoning and acting in language models

Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik R., and Cao, Yuan. ReAct : Synergizing reasoning and acting in language models. In Proc. ICLR, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[31]

TopoMAS : Large language model driven topological materials multiagent system, 2025 a

Zhang, Baohua, Li, Xin, Xu, Huangchao, Jin, Zhong, Wu, Quansheng, and Li, Ce. TopoMAS : Large language model driven topological materials multiagent system, 2025 a . URL http://arxiv.org/abs/2507.04053. arXiv:2507.04053

work page arXiv 2025
[32]

Zhang, Y. et al. DP-GEN : A concurrent learning platform for the generation of reliable deep learning based potential energy models. Comput. Phys. Commun., 253: 0 107206, 2020

work page 2020
[33]

cAST : Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, 2025 b

Zhang, Yilin, Zhao, Xinran, Wang, Zora Zhiruo, Yang, Chenyang, Wei, Jiayi, and Wu, Tongshuang. cAST : Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, 2025 b . URL http://arxiv.org/abs/2506.15655

work page arXiv 2025
[34]

Integrating machine learning and large language models to advance exploration of electrochemical reactions

Zheng, Zhiling, Florit, Federico, Jin, Brooke, Wu, Haoyang, Li, Shih-Cheng, Nandiwale, Kakasaheb Y., Salazar, Chase A., Mustakis, Jason G., Green, William H., and Jensen, Klavs F. Integrating machine learning and large language models to advance exploration of electrochemical reactions. Angewandte Chemie International Edition, 64 0 (6): 0 e202418074, 2025...

work page doi:10.1002/anie.202418074 2025
[35]

El A gente: An autonomous agent for quantum chemistry

Zou, Yunheng, Cheng, Austin H., Aldossary, Abdulrahman, Bai, Jiaru, Leong, Shi Xuan, Campos-Gonzalez-Angulo, Jorge Arturo, Choi, Changhyeok, Ser, Cher Tian, Tom, Gary, Wang, Andrew, Zhang, Zijian, Yakavets, Ilya, Hao, Han, Crebolder, Chris, Bernales, Varinia, and Aspuru-Guzik, Al\' a n. El A gente: An autonomous agent for quantum chemistry. Matter, 8 0 (7...

work page doi:10.1016/j.matt.2025.102263 2025
[36]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Agent-based learning of materials datasets from the scientific literature

Ansari, Mehrad and Moosavi, Seyed Mohamad. Agent-based learning of materials datasets from the scientific literature. Digital Discovery, 3 0 (12): 0 2607--2617, 2024. doi:10.1039/D4DD00252K

work page doi:10.1039/d4dd00252k 2024

[2] [2]

Autonomous chemical research with large language models

Boiko, Daniil A., MacKnight, Robert, Kline, Ben, and Gomes, Gabe. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023

[3] [3]

Augmenting large language models with chemistry tools

Bran, Andres M., Cox, Sam, Schilter, Oliver, Baldassari, Carlo, White, Andrew D., and Schwaller, Philippe. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6 0 (5): 0 525--535, 2024. doi:10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024

[4] [4]

code-chunk: Tree-sitter based semantic code chunking, 2025

code-chunk contributors . code-chunk: Tree-sitter based semantic code chunking, 2025. https://github.com/nicobailon/code-chunk

work page 2025

[5] [5]

A., and Buettcher, Stefan

Cormack, Gordon V., Clarke, Charles L. A., and Buettcher, Stefan. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proc. SIGIR, pages 758--759, 2009. doi:10.1145/1571941.1572114

work page doi:10.1145/1571941.1572114 2009

[6] [6]

Atomate2: Modular workflows for materials science, 2025

Ganose, Alex, Sahasrabuddhe, Hrushikesh, et al. Atomate2: Modular workflows for materials science, 2025. URL https://chemrxiv.org/doi/full/10.26434/chemrxiv-2025-tcr5h. Digital Discovery, 2025, 4, 1944--1973

work page doi:10.26434/chemrxiv-2025-tcr5h 2025

[7] [7]

He, R. et al. Unconventional ferroelectric domain switching dynamics in CuInP _2 S _6 from first principles. Phys. Rev. B, 108: 0 024305, 2023. doi:10.1103/PhysRevB.108.024305

work page doi:10.1103/physrevb.108.024305 2023

[8] [8]

Context rot: How increasing input tokens impacts LLM performance, 2025

Hong, Kelly, Troynikov, Anton, and Huber, Jeff. Context rot: How increasing input tokens impacts LLM performance, 2025. URL https://www.trychroma.com/research/context-rot. Chroma Research Technical Report

work page 2025

[9] [9]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, Carlos E., Yang, John, Wettig, Alexander, Yao, Shunyu, Pei, Kexin, Press, Ofir, and Narasimhan, Karthik. SWE-bench : Can language models resolve real-world GitHub issues?, 2024. URL http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

Kang, Minki, Chen, Wei-Ning, Han, Dongge, Inan, Huseyin A., Wutschitz, Lukas, Chen, Yanzhi, Sim, Robert, and Rajmohan, Saravan. ACON : Optimizing context compression for long-horizon LLM agents, 2025. URL https://arxiv.org/abs/2510.00615

work page internal anchor Pith review arXiv 2025

[11] [11]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management, 2025

Lindenbauer, Tobias, Slinko, Igor, Felder, Ludwig, Bogomolov, Egor, and Zharov, Yaroslav. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management, 2025. URL http://arxiv.org/abs/2508.21433

work page arXiv 2025

[12] [12]

VASPilot : MCP -facilitated multi-agent intelligence for autonomous VASP simulations

Liu, Jiaxuan, Zhu, Tiannian, Ye, Caiyuan, Fang, Zhong, Weng, Hongming, and Wu, Quansheng. VASPilot : MCP -facilitated multi-agent intelligence for autonomous VASP simulations. Chinese Physics B, 34 0 (11): 0 117106, 2025 a . doi:10.1088/1674-1056/ae0681

work page doi:10.1088/1674-1056/ae0681 2025

[13] [13]

Lost in the Middle: How Language Models Use Long Contexts

Liu, Nelson F., Lin, Kevin, Hewitt, John, Paranjape, Ashwin, Bevilacqua, Michele, Petroni, Fabio, and Liang, Percy. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12: 0 157--173, 2024. doi:10.1162/tacl\_a\_00638

work page internal anchor Pith review doi:10.1162/tacl 2024

[14] [14]

Liu, S. et al. MatTools : Benchmarking LLM tool-use for materials science, 2025 b . URL http://arxiv.org/abs/2505.10852. arXiv:2505.10852

work page arXiv 2025

[15] [15]

Intrinsic ferroelectric switching from first principles

Liu, Shi, Grinberg, Ilya, and Rappe, Andrew M. Intrinsic ferroelectric switching from first principles. Nature, 534 0 (7607): 0 360--363, 2016. doi:10.1038/nature18286

work page doi:10.1038/nature18286 2016

[16] [16]

Python M aterials G enomics (pymatgen): A robust, open-source P ython library for materials analysis

Ong, Shyue Ping, Richards, William Davidson, Jain, Anubhav, Hautier, Geoffroy, Kocher, Michael, Cholia, Shreyas, Gunter, Dan, Chevrier, Vincent L., Persson, Kristin A., and Ceder, Gerbrand. Python M aterials G enomics (pymatgen): A robust, open-source P ython library for materials analysis. Computational Materials Science, 68: 0 314--319, 2013. doi:10.101...

work page doi:10.1016/j.commatsci.2012.10.028 2013

[17] [17]

MemGPT: Towards LLMs as Operating Systems

Packer, Charles, Wooders, Sarah, Lin, Kevin, Fang, Vivian, Patil, Shishir G., Stoica, Ion, and Gonzalez, Joseph E. MemGPT : Towards LLM s as operating systems, 2024. URL http://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Nanoscale studies of ferroelectric domain walls as pinned elastic interfaces

Paruch, Patrycja and Guyonnet, Jill. Nanoscale studies of ferroelectric domain walls as pinned elastic interfaces. Comptes Rendus Physique, 14 0 (8): 0 667--684, 2013. doi:10.1016/j.crhy.2013.08.004

work page doi:10.1016/j.crhy.2013.08.004 2013

[19] [19]

Taskweaver: A code-first agent framework

Qiao, Bo, Li, Liqun, Zhang, Xu, He, Shilin, Kang, Yu, Zhang, Chaoyun, Yang, Fangkai, Dong, Hang, Zhang, Jue, Wang, Lu, Ma, Minghua, Zhao, Pu, Qin, Si, Qin, Xiaoting, Du, Chao, Xu, Yong, Lin, Qingwei, Rajmohan, Saravan, and Zhang, Dongmei. TaskWeaver : A code-first agent framework, 2024. URL http://arxiv.org/abs/2311.17541

work page arXiv 2024

[20] [20]

GPQA : A graduate-level Google -proof Q&A benchmark

Rein, David, Hou, Betty Li, Stickland, Asa Cooper, Petty, Jackson, Pang, Richard Yuanzhe, Dirani, Julien, Michael, Julian, and Bowman, Samuel R. GPQA : A graduate-level Google -proof Q&A benchmark. Proc. COLM, 2024

work page 2024

[21] [21]

Jobflow: Computational workflows made simple

Rosen, Andrew S., Gallant, Max, George, Janine, Riebesell, Janosh, Sahasrabuddhe, Hrushikesh, Shen, Jimmy-Xuan, Wen, Mingjian, Evans, Matthew L., Petretto, Guido, Waroquiers, David, Rignanese, Gian-Marco, Persson, Kristin A., Jain, Anubhav, and Ganose, Alex M. Jobflow: Computational workflows made simple. Journal of Open Source Software, 9 0 (93): 0 5995,...

work page doi:10.21105/joss.05995 2024

[22] [22]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, Noah, Cassano, Federico, Berman, Edward, Gopinath, Ashwin, Narasimhan, Karthik, and Yao, Shunyu. Reflexion: Language agents with verbal reinforcement learning, 2023. URL http://arxiv.org/abs/2303.11366. NeurIPS 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Cognitive Architectures for Language Agents

Sumers, Theodore R., Yao, Shunyu, Narasimhan, Karthik, and Griffiths, Thomas L. Cognitive architectures for language agents, 2024. URL http://arxiv.org/abs/2309.02427

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Vriza, Aikaterini, Kornu, Uma, Koneru, Aditya, Chan, Henry, and Sankaranarayanan, Subramanian K. R. S. Multi-agentic AI framework for end-to-end atomistic simulations. Digital Discovery, 5 0 (1): 0 440--452, 2026. doi:10.1039/D5DD00435G

work page doi:10.1039/d5dd00435g 2026

[25] [25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, Guanzhi, Xie, Yuqi, Jiang, Yunfan, Mandlekar, Ajay, Xiao, Chaowei, Zhu, Yuke, Fan, Linxi, and Anandkumar, Anima. Voyager: An open-ended embodied agent with large language models, 2023. URL http://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

DeePMD-kit : A deep learning package for many-body potential energy representation and molecular dynamics

Wang, Han, Zhang, Linfeng, Han, Jiequn, and E, Weinan. DeePMD-kit : A deep learning package for many-body potential energy representation and molecular dynamics. Computer Physics Communications, 228: 0 178--184, 2018. doi:10.1016/j.cpc.2018.03.016

work page doi:10.1016/j.cpc.2018.03.016 2018

[27] [27]

Executable code actions elicit better LLM agents, 2024

Wang, Xingyao, Chen, Yangyi, Yuan, Lifan, Zhang, Yizhe, Li, Yunzhu, Peng, Hao, and Ji, Heng. Executable code actions elicit better LLM agents, 2024. URL http://arxiv.org/abs/2402.01030. ICML 2024

work page arXiv 2024

[28] [28]

An agentic framework for autonomous materials computation, 2025

Xia, Zeyu, Ma, Jinzhe, Zheng, Congjie, Zhang, Shufei, Li, Yuqiang, Su, Hang, Hu, P., Zhang, Changshui, Gong, Xingao, Ouyang, Wanli, Bai, Lei, Zhou, Dongzhan, and Su, Mao. An agentic framework for autonomous materials computation, 2025. URL http://arxiv.org/abs/2512.19458. arXiv:2512.19458

work page arXiv 2025

[29] [29]

Efficient streaming language models with attention sinks

Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, and Lewis, Mike. Efficient streaming language models with attention sinks. Proc. ICLR, 2024

work page 2024

[30] [30]

ReAct : Synergizing reasoning and acting in language models

Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik R., and Cao, Yuan. ReAct : Synergizing reasoning and acting in language models. In Proc. ICLR, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

work page 2023

[31] [31]

TopoMAS : Large language model driven topological materials multiagent system, 2025 a

Zhang, Baohua, Li, Xin, Xu, Huangchao, Jin, Zhong, Wu, Quansheng, and Li, Ce. TopoMAS : Large language model driven topological materials multiagent system, 2025 a . URL http://arxiv.org/abs/2507.04053. arXiv:2507.04053

work page arXiv 2025

[32] [32]

Zhang, Y. et al. DP-GEN : A concurrent learning platform for the generation of reliable deep learning based potential energy models. Comput. Phys. Commun., 253: 0 107206, 2020

work page 2020

[33] [33]

cAST : Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, 2025 b

Zhang, Yilin, Zhao, Xinran, Wang, Zora Zhiruo, Yang, Chenyang, Wei, Jiayi, and Wu, Tongshuang. cAST : Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree, 2025 b . URL http://arxiv.org/abs/2506.15655

work page arXiv 2025

[34] [34]

Integrating machine learning and large language models to advance exploration of electrochemical reactions

Zheng, Zhiling, Florit, Federico, Jin, Brooke, Wu, Haoyang, Li, Shih-Cheng, Nandiwale, Kakasaheb Y., Salazar, Chase A., Mustakis, Jason G., Green, William H., and Jensen, Klavs F. Integrating machine learning and large language models to advance exploration of electrochemical reactions. Angewandte Chemie International Edition, 64 0 (6): 0 e202418074, 2025...

work page doi:10.1002/anie.202418074 2025

[35] [35]

El A gente: An autonomous agent for quantum chemistry

Zou, Yunheng, Cheng, Austin H., Aldossary, Abdulrahman, Bai, Jiaru, Leong, Shi Xuan, Campos-Gonzalez-Angulo, Jorge Arturo, Choi, Changhyeok, Ser, Cher Tian, Tom, Gary, Wang, Andrew, Zhang, Zijian, Yakavets, Ilya, Hao, Han, Crebolder, Chris, Bernales, Varinia, and Aspuru-Guzik, Al\' a n. El A gente: An autonomous agent for quantum chemistry. Matter, 8 0 (7...

work page doi:10.1016/j.matt.2025.102263 2025

[36] [36]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page