El Agente Forjador: Task-Driven Agent Generation for Quantum Simulation
Pith reviewed 2026-05-10 11:03 UTC · model grok-4.3
The pith
LLM agents can autonomously generate and reuse tools to solve quantum simulations more accurately.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM coding agents can autonomously forge, validate, and reuse computational tools through a four-stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation, producing higher accuracy on quantum simulation tasks than baseline direct solving and enabling cost-effective reuse across agent strengths and domains.
What carries the argument
The four-stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation that lets agents create and share task-specific computational tools on demand.
If this is right
- Reusing a toolset built by a stronger agent reduces API cost and raises solution quality for weaker agents.
- Tools forged for different domains can be combined to solve hybrid quantum tasks.
- Accuracy improves consistently over zero-shot tool generation per task and over direct baseline solving.
- Agent capabilities become defined by the tasks they can solve rather than by pre-engineered tool implementations.
Where Pith is reading between the lines
- The same workflow could let agents adapt to new scientific libraries without repeated human curation.
- Shared tool libraries might accumulate improvements across many agents and sessions.
- Similar tool-forging could apply outside quantum science to fields that rely on evolving code libraries.
- Automated checks for tool correctness on edge cases would be needed before trusting the outputs in research.
Load-bearing premise
The agents can reliably generate and validate scientifically correct tools without introducing subtle errors that only appear on harder or unseen quantum problems.
What would settle it
A demonstration that the generated tools produce wrong answers on a new set of complex quantum problems where standard numerical solvers give correct results would falsify the reliability of autonomous tool creation.
read the original abstract
AI for science promises to accelerate the discovery process. The advent of large language models (LLMs) and agentic workflows enables the expediting of a growing range of scientific tasks. However, most of the current generation of agentic systems depend on static, hand-curated toolsets that hinder adaptation to new domains and evolving libraries. We present El Agente Forjador, a multi-agent framework in which universal coding agents autonomously forge, validate, and reuse computational tools through a four-stage workflow of tool analysis, tool generation, task execution, and iterative solution evaluation. Evaluated across 24 tasks spanning quantum chemistry and quantum dynamics on five coding agent setups, we compare three operating modes: zero-shot generation of tools per task, reuse of a curriculum-built toolset, and direct problem-solving with the coding agents as the baseline. We find that our tool generation and reuse framework consistently improves accuracy over the baseline. We also show that reusing a toolset built by a stronger coding agent can reduce API cost and substantially raises the solution quality for weaker coding agents. Case studies further demonstrate that tools forged for different domains can be combined to solve hybrid tasks. Taken together, these results show that LLM-based agents can use their scientific knowledge and coding capabilities to autonomously build reusable scientific tools, pointing toward a paradigm in which agent capabilities are defined by the tasks they are designed to solve rather than by explicitly engineered implementations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces El Agente Forjador, a multi-agent framework in which LLM coding agents autonomously perform a four-stage workflow (tool analysis, generation, task execution, and iterative evaluation) to forge, validate, and reuse computational tools for quantum simulation. It evaluates three operating modes—zero-shot per-task tool generation, reuse of a curriculum-built toolset, and direct baseline problem-solving—across 24 tasks in quantum chemistry and quantum dynamics using five coding-agent setups. The central claims are that the tool-generation-and-reuse framework yields consistent accuracy gains over baseline, that toolsets forged by stronger agents reduce API cost and improve solution quality for weaker agents, and that cross-domain tools can be combined to solve hybrid tasks.
Significance. If the reported gains are supported by detailed quantitative metrics and rigorous validation of scientific correctness, the work would meaningfully advance agentic AI for science by showing that agents can dynamically construct and share reusable scientific tooling rather than depending on static hand-curated libraries. The cross-domain tool-combination case studies would be especially valuable for demonstrating modular, composable capabilities in multi-physics settings. The empirical nature of the study (no free parameters or circular derivations) is a strength, but the absence of concrete performance numbers and validation details currently limits the strength of the conclusions.
major comments (3)
- [Abstract] Abstract: the claims of 'consistent accuracy improvements' and 'substantially raises the solution quality' are presented without any quantitative metrics, error bars, statistical tests, or description of how accuracy was measured (e.g., against analytical solutions, reference implementations, or conservation laws). This information is load-bearing for the central empirical comparison.
- [§3 (workflow description)] Four-stage workflow (analysis, generation, execution, evaluation): the validation step is described only at high level. If it relies primarily on task completion or basic unit tests rather than cross-checks against analytical solutions, reference codes, or physical invariants (e.g., operator ordering in Trotterization or basis-set correctness), subtle scientific errors could remain undetected and would undermine the transferability claims for hybrid tasks.
- [§4 or §5 (experimental evaluation)] Evaluation section (24 tasks): no justification is supplied for why the chosen tasks adequately sample the space of real-world quantum simulation challenges, nor are the concrete accuracy metrics or success criteria for chemistry versus dynamics tasks specified. This makes it impossible to judge whether the reported gains generalize beyond the selected set.
minor comments (2)
- [Evaluation] The manuscript would benefit from a table summarizing the five coding-agent setups, the three operating modes, and the quantitative outcomes (accuracy, cost, quality) for each combination.
- [§3] Notation for the four-stage workflow and the 'curriculum-built toolset' should be introduced once with a clear diagram or pseudocode to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. These have helped us identify opportunities to strengthen the clarity of our empirical claims, the description of our validation procedures, and the justification of our experimental design. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of 'consistent accuracy improvements' and 'substantially raises the solution quality' are presented without any quantitative metrics, error bars, statistical tests, or description of how accuracy was measured (e.g., against analytical solutions, reference implementations, or conservation laws). This information is load-bearing for the central empirical comparison.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The body of the manuscript already contains the full set of accuracy metrics (including means, standard deviations, and comparisons to analytical solutions and reference implementations), error bars, and statistical comparisons across the three operating modes. In the revised abstract we will add representative quantitative findings and a concise description of the evaluation methodology. revision: yes
-
Referee: [§3 (workflow description)] Four-stage workflow (analysis, generation, execution, evaluation): the validation step is described only at high level. If it relies primarily on task completion or basic unit tests rather than cross-checks against analytical solutions, reference codes, or physical invariants (e.g., operator ordering in Trotterization or basis-set correctness), subtle scientific errors could remain undetected and would undermine the transferability claims for hybrid tasks.
Authors: The validation stage combines execution success checks with scientific validation steps that include comparisons to analytical solutions (where available), verification of physical invariants such as energy conservation and operator ordering, and cross-checks against reference implementations. We will revise §3 to provide an explicit, expanded description of these validation procedures so that the robustness of the forged tools is transparent. revision: yes
-
Referee: [§4 or §5 (experimental evaluation)] Evaluation section (24 tasks): no justification is supplied for why the chosen tasks adequately sample the space of real-world quantum simulation challenges, nor are the concrete accuracy metrics or success criteria for chemistry versus dynamics tasks specified. This makes it impossible to judge whether the reported gains generalize beyond the selected set.
Authors: The 24 tasks were chosen to span representative problems in quantum chemistry and quantum dynamics drawn from standard benchmarks in the literature. We will add a dedicated paragraph in the experimental evaluation section that justifies the task selection on the basis of their coverage of core simulation challenges and that explicitly states the accuracy metrics and success criteria applied to each domain. revision: yes
Circularity Check
No significant circularity in empirical agent framework evaluation
full rationale
The paper is an empirical comparison of three agent operating modes (zero-shot tool generation, curriculum-built toolset reuse, and direct problem-solving baseline) across 24 quantum chemistry and dynamics tasks. It reports accuracy gains and cost reductions from tool reuse without any mathematical derivation chain, fitted parameters, self-definitional constructs, or load-bearing self-citations that reduce claims to inputs by construction. The central results rest on task-completion metrics and case studies rather than equations or uniqueness theorems that could create circularity, making the evaluation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Model card for Gemini 3.1 Pro, Google’s most advanced multimodal reasoning model as of publication date. Ignacio Gustin, Luis Mantilla Calderón, Juan B. Pérez-Sánchez, Jérôme F. Gonthier, Yuma Nakamura, Karthik Panicker, Manav Ramprasad, Zijian Zhang, Yunheng Zou, Varinia Bernales, and Alán Aspuru-Guzik. El agente cuantico: Automating quantum simulations....
-
[2]
doi:10.1002/wcms.1340.https://doi.org/10.1002/wcms.1340
ISSN 1759-0884. doi:10.1002/wcms.1340.https://doi.org/10.1002/wcms.1340. Qiming Sun, Xing Zhang, Samragni Banerjee, Peng Bao, Marc Barbry, Nick S. Blunt, Nikolay A. Bogdanov, George H. Booth, Jia Chen, Zhi-Hao Cui, Janus J. Eriksen, Yang Gao, Sheng Guo, Jan Hermann, Matthew R. Hermes, Kevin Koh, Peter Koval, Susi Lehtola, Zhendong Li, Junzi Liu, Narbe Mar...
work page doi:10.1002/wcms.1340.https://doi.org/10.1002/wcms.1340 2020
-
[8]
Atomic charge analysis (Mulliken) Compound: •caffeine:CN1C=NC2=C1C(=O)N(C(=O)N2C)C Always verify the presence of any imaginary vibrational frequencies—excluding translational and rota- tional modes—using the Hessian computed in PySCF with mf.grids.level = 3. If an imaginary mode is identified, displace the structure along the corresponding normal mode and...
-
[14]
Assume you have access to the initial geometry from the corresponding XYZ files
Atomic charge analysis (Mulliken) Compounds: •caffeine:CN1C=NC2=C1C(=O)N(C(=O)N2C)C •theobromine:CN1C=NC2=C1C(=O)NC(=O)N2C •acetylsalicylic_acid:CC(=O)OC1=CC=CC=C1C(=O)O Organic Compounds – Level 2 Prompt Organic Molecule Analysis - Level 2For the 6 molecules defined below by their filenames, charge, and multiplicity, perform a geometry optimization with ...
-
[20]
Atomic charge analysis (Mulliken) Molecules:
-
[21]
caffeine_openbabel.xyz (charge = 0; multiplicity = 1)
-
[22]
theobromine_openbabel.xyz (charge = 0; multiplicity = 1)
-
[23]
aspirin_openbabel.xyz (charge = 0; multiplicity = 1)
-
[24]
methyl_salicylate_openbabel.xyz (charge = 0; multiplicity = 1)
-
[25]
diisopropylamide_anion_openbabel.xyz (charge = -1; multiplicity = 1)
-
[26]
After optimization, generate a separate report for each molecule
diisopropylammonium_cation_openbabel.xyz (charge = +1; multiplicity = 1) Inorganic Compounds – Level 1 Prompt Inorganic Molecule Analysis - Level 1For the three inorganic compounds listed below, perform a geometry optimization using the Hartree-Fock (HF) method and the def2-SVP basis set in the gas phase. After optimization, generate a separate report for...
-
[28]
Total energy (in Hartrees) 26
-
[33]
Assume you have access to the initial geometry from the corresponding XYZ files
An image of the optimized structure Compounds: •Chromium(0) hexacarbonyl (low spin):[Cr](=C=O)(=C=O)(=C=O)(=C=O)(=C=O)(=C=O) •Chlorine trifluoride:FCl(F)F •Fluorophosphoric acid (singly deprotonated form):[O-]P(F)(O)=O Inorganic Compounds – Level 2 Prompt Inorganic Molecule Analysis - Level 2For the 6 inorganic molecules defined below by their filenames, ...
-
[34]
Final Cartesian coordinates (in Å)
-
[35]
Total energy (in Hartrees)
-
[36]
Point group symmetry
-
[37]
Dipole moment (in Debye)
-
[38]
Molecular orbital analysis (including an MO energy table and the HOMO–LUMO gap)
-
[39]
Atomic charge analysis (Mulliken)
-
[40]
An image of the optimized structure Molecules:
-
[41]
chromium_hexacarbonyl.xyz (charge = 0; multiplicity = 1)
-
[42]
chlorine_trifluoride.xyz (charge = 0; multiplicity = 1)
-
[43]
fluorophosphoric_acid_singly_deprotonated_form.xyz (charge = -1; multiplicity = 1)
-
[44]
trifluoromethane_sulfonate.xyz (charge = -1; multiplicity = 1)
-
[45]
cyclohexyldimethylphosphine.xyz (charge = 0; multiplicity = 1)
-
[46]
t-butylisothiocyanate.xyz (charge = 0; multiplicity = 1) Carbocations – Level 1 Prompt Carbocation Stability - Level 1Calculate the carbocation formation enthalpies (∆H) and Gibbs free energies (∆G) for the reaction: R-H -> R+ + H- The R-H compounds to study are: methane, ethane, propane, 2-methylpropane, toluene, benzene, dimethyl ether, trimethylamine, ...
-
[47]
The provided hydride (H-) structure should be used as-is without optimization
Optimize the structures of all R-H and R+ species using DFT with the B3LYP functional and def2-SVP basis set. The provided hydride (H-) structure should be used as-is without optimization
-
[49]
From the outputs, calculate the formation enthalpy and Gibbs free energy for each R-H compound’s reaction
-
[50]
Report the results (in kcal/mol) in a table and save it to the report.md file. Carbocations – Level 2 Prompt Carbocation Stability - Level 2Calculate the carbocation formation enthalpies (∆H) and Gibbs free energies (∆G) for the reaction: R−H→R + +H − Instructions:
-
[51]
Also include the hydride anion (H-)
Generate 3D geometries for the R-H and R+ species from the SMILES strings below. Also include the hydride anion (H-)
-
[52]
The hydride (H-) structure should not be optimized
Optimize the geometries of all R-H and R+ species using DFT with the B3LYP functional and def2-SVP basis set. The hydride (H-) structure should not be optimized
-
[53]
Use the following charge and multiplicity: •R-H molecules: charge 0, multiplicity 1 •R+ carbocations: charge 1, multiplicity 1 •Hydride (H-): charge -1, multiplicity 1
-
[54]
From the outputs, calculate the formation enthalpy and Gibbs free energy for each reaction
-
[55]
Report the results (in kcal/mol) in a table and save it to a text file. SMILES Strings: •R-H compounds: –methane: C –ethane: CC –propane: CCC –2-methylpropane: CC(C)C –toluene: Cc1ccccc1 –benzene: c1ccccc1 –dimethyl ether: COC –trimethylamine: CN(C)C –propene: C=CC •R+ carbocations: 28 –CH3+ –CH2+C –CCH+C –CC+(C)C –c1c(cccc1)CH2+ –c1c+cccc1 –COCH2+ –CN(C)...
-
[56]
Calculate Reaction Energies: Compute the∆H and∆ G for the following reactions, for n¯4, 5, 6, 7, and 8: cyclo(CnH2n)→cyclo(Cn-1H2n-3)-CH3 •Use the B3LYP/def2-svp level of theory. •All structures must be optimized, and frequency calculations are required to obtain enthalpies and Gibbs free energies. •The first reaction (n = 4) is cyclobutane (C1CCC1)→ meth...
-
[57]
Acetic acid; pKa = 4.76
-
[58]
Fluoroacetic acid; pKa = 2.586
-
[59]
Chloroacetic acid; pKa = 2.86 TD-DFT – Level 1 Prompt Electronic Absorption Spectra - Level 1Compute the energy level of S1, the energy difference between S1 and T1, and the oscillator strength to the S1 state for the following structures from the default working directory: 2.xyz, 3.xyz, 5.xyz. Perform a single-point TDDFT (after geometry optimization and...
work page 2026
-
[60]
Apply a Hadamard gate on qubit 0 and then a CNOT with control qubit 0 and target qubit 1
Start in |00⟩. Apply a Hadamard gate on qubit 0 and then a CNOT with control qubit 0 and target qubit 1. Measure both qubits in the computational⟨Z|Z⟩ basis with 4096 shots and return the measurement counts. From those counts, compute and return the expectation value of Z⊗Z . Then also estimate the expectation value ofX⊗X by measuring in the X basis, agai...
-
[61]
Add a depolarizing noise channel with probabilityp
Start in|00⟩, apply a Hadamard gate on qubit 0 and then a CNOT with control qubit 0 and target qubit 1. Add a depolarizing noise channel with probabilityp. Simulate the circuit forp∈ { 0, 0.05, 0.1, 0.2, 0.3}. For each value of p, run 4096 shots in theZ basis, return the measurement counts, and compute⟨Z⊗Z⟩ . Then insert Hadamard gates on both qubits to m...
-
[62]
Whether there are bugs that haven’t been fixed 33
-
[63]
Whether the implementation is complete and correct
-
[64]
Whether the key tools are well implemented
-
[65]
Whether more simulation is needed
-
[66]
Whether the report satisfies all requirements
-
[67]
Task complete; no further action needed
What the next step should be if the task is not complete The task description is read from./question.mdand the report to evaluate from./report.md. Evaluation Criteria: Bug Detection: •Check if the report mentions any errors, exceptions, or failures •Look for incorrect results or unexpected behavior •Identify missing error handling or edge cases Script Com...
-
[68]
Be thorough but not overly strict; minor issues that do not affect correctness may not require rework
-
[69]
Focus on whether the task requirements are actually met, not whether the approach is optimal
-
[70]
If the report indicates successful completion and all requirements appear met, do not create unnecessary next steps
-
[71]
Be specific in yournext_step_plan: provide actionable guidance, but do not suggest installing new software
-
[72]
Consider the task context; some tasks may intentionally not require certain features Please do thorough research and then create theevaluation.jsonfile. D Judge Criteria and Ground Truths Our evaluation criteria build on those introduced inEl Agente Q (Zou et al., 2025) and El Agente Gráfico (Bai et al., 2026), but extend them in two ways. First, we adopt...
-
[73]
Geometry generation & optimization: Use smiles_to_3d_geometry to convert ‘C=C’ SMILES to initial 54 3D coordinates, then userun_dft_geometry_optimizationwith B3LYP/def2-SVP to get the optimized geometry
-
[74]
TD-DFT excited state: Userun_tddft_calculation on the converged DFT mean-field object to compute the S1 excitation energy (π→π ∗ transition), getting excitation energies and oscillator strengths
-
[75]
Use the newvisualize_molecular_orbitals tool to create contour plots of theπ andπ ∗ orbitals
MO analysis: Use analyze_molecular_orbitals to identify the HOMO (π) and LUMO (π∗) orbital indices and energies. Use the newvisualize_molecular_orbitals tool to create contour plots of theπ andπ ∗ orbitals
-
[76]
This provides a classical reference for the quantum algorithm
Active space setup: Use run_pyscf_casci to validate the active space choice (e.g., 2 electrons in 2 orbitals for minimalπ/π ∗, or larger spaces includingσ/σ ∗). This provides a classical reference for the quantum algorithm
-
[77]
Quantum Hamiltonian construction: Use tequila_build_molecular_hamiltonian with the chosen active space to generate the qubit Hamiltonian for ethylene
-
[78]
Quantum Subspace Expansion: Use the newrun_quantum_subspace_expansion tool with CUDA-Q to prepare the HF ground state, build excitation subspace, and compute the first excited state energy via QSE
-
[80]
(+230 chars) Step 5:Write— ‘/evaluation.json‘ (225 chars) Createdevaluation.json
Analysis of π→π* transition, TD-DFT vs quantum limitations, correlation/active space effects- Done✓ 6.MO visualization- Done, images exist✓ The rep... (+230 chars) Step 5:Write— ‘/evaluation.json‘ (225 chars) Createdevaluation.json. The report satisfies all task requirements: •Geometry optimizationat B3LYP/def2-SVP completed 58 •TD-DFT S1 energycomputed (...
-
[81]
Compute electron density at Rb nucleus using PySCF
-
[82]
Use that to derive hyperfine frequency (6.834 GHz for Rb-87)
-
[83]
Simulate Lindblad dynamics with QuTiP for two initial states
-
[84]
Plot results Let me first check what tools are available in the tools/ directory. Step 1:Bash— ‘ls tools/ 2>/dev/null‘ Step 2:Bash— ‘pwd && ls ‘ Step 3:Bash— ‘ls tools/molecular_electronic_structure_tools/ && echo "–...‘ Step 4:Bash— ‘ls tools/qutip_quantum_dynamics_tools/qutip_quantum_dynami...‘ Step 5:Read— ‘tools/compute_electron_density_at_nuclei.py‘ ...
-
[85]
PySCF DFT(B3LYP/def2-SVP): Computed electron density at Rb nucleus = 2.655 e/bohr^3. The computed Fermi contact coupling overestimates the experimental value because it includes all core electrons, not just the valence 5s electron. The experimental hyperfine splitting (6.835 GHz) is used for dynamics. 1.Lindblad dynamics(T1=10, T2=5 in units of 1/omega): ...
-
[86]
Build Hamiltonian ascudaq.SpinOperatorfrom Pauli terms
-
[87]
Create reference state viacudaq.State.from_data()
-
[88]
Generate excitation operators (single/double) as Pauli strings, apply them to reference state as matrices
-
[89]
Build H and S matrices usingstate.overlap()and matrix-vector products
-
[90]
Solve generalized eigenvalue problem withscipy.linalg.eigh Step 10: TodoWrite— {"todos": [{"content": "Search for cudaq APIs relevant to QSE", "status": "com- pleted", "activeForm": "Searching for cudaq APIs"}, {"content": "Plan implementation", "status": "com- pleted", "activeForm"... Step 11:Write— ‘/tool_smith/task_0f1ba6682313/run_quantum_subspace_exp...
-
[91]
Basis vectors not normalized: Line 264-266 checks norm > 1e-12 but doesn’t normalize. The comment on line 257 says "(normalized)" but normalization isn’t done. This is actually fine for QSE - you don’t need to normalize because you solve the generalized eigenvalue problem Hc = ES*c which handles non-orthonormal bases
-
[92]
Using.real on overlap matrix: Line 287np.linalg.eigh(S_proj.real) - this discards imaginary parts. For a Hermitian matrix, the ... (+1152 chars) I see several algorithm issues. The main problems are discarding imaginary parts of complex matrices and using transpose instead of conjugate transpose. Let me fix them. Step 2:Edit— ‘/tool_smith/task_0f1ba668231...
-
[93]
This is fine - the overlap matrix S handles non-orthonormal bases
The basis vectors are NOT normalized before building H_proj and S_proj (line 266 - no normalization). This is fine - the overlap matrix S handles non-orthonormal bases. The generalized eigenvalue probl ... (+867 chars) The previous fixes are all properly applied. Let me run a quick test to verify correctness. Step 2: Bash — Test QSE tool correctness: ‘pyt...
-
[94]
Eigenvector output uses.realfor the float-typed model (line 313) Verified with a 2-qubit Hamiltonian — QSE energies match exact diagonalization to machine precision. 65
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.