SciML Agents: Write the Solver, Not the Solution

Amir Gholami; Dmitriy Morozov; Haocheng Xi; Kurt Keutzer; Michael W. Mahoney; Rishabh Tiwari; Saarth Gaonkar; Xiang Zheng

arxiv: 2509.09936 · v1 · submitted 2025-09-12 · 💻 cs.LG · cs.NA· math.NA

SciML Agents: Write the Solver, Not the Solution

Saarth Gaonkar , Xiang Zheng , Haocheng Xi , Rishabh Tiwari , Kurt Keutzer , Dmitriy Morozov , Michael W. Mahoney , Amir Gholami This is my paper

Pith reviewed 2026-05-18 17:54 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NA

keywords scientific machine learningODE solverslarge language modelscode generationnumerical methodsagent systemsbenchmarkingstiff equations

0 comments

The pith

LLMs can generate executable code that selects and applies appropriate numerical solvers for ordinary differential equations given natural language descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can function as agents that write code to solve ODE problems by choosing suitable numerical methods instead of directly approximating solutions with neural networks. This matters because decades of established algorithms already handle stiffness, stability, and accuracy, so the task reduces to making correct domain-aware choices in code. The authors introduce a diagnostic dataset of misleading problems that require algebraic insight to classify as non-stiff and a benchmark of one thousand diverse ODE tasks spanning stiff and non-stiff regimes. They test open- and closed-source models under unguided and guided prompting conditions, measuring whether the output code runs and matches reference numerical results. With domain-specific guidance, newer instruction-following models reach high accuracy on both executability and validity, indicating that careful prompting can produce reliable SciML agents for these tasks.

Core claim

Given a natural-language description of an ODE, LLMs can produce runnable code that selects a scientifically appropriate solver, enforces stability checks, and yields numerically valid results when evaluated against reference solutions on both a diagnostic set of superficially misleading problems and a 1,000-task benchmark covering stiff and non-stiff regimes.

What carries the argument

Guided prompting that supplies domain knowledge about stiffness classification and solver selection, enabling the LLM to translate problem descriptions into executable numerical code.

If this is right

LLMs can distinguish superficial indicators of stiffness from actual mathematical requirements through algebraic simplification when prompted.
The burden in scientific machine learning shifts from learning solution functions to selecting and configuring existing numerical algorithms.
Newer instruction-following models achieve strong performance on executability and validity without additional fine-tuning when given sufficient context.
The introduced diagnostic and large-scale benchmarks provide concrete measures for progress on LLM capabilities in scientific code generation.
Fine-tuning remains useful for older or smaller models while recent systems often succeed off-the-shelf with guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting strategy could extend to generating code for partial differential equations or other simulation tasks that rely on mature numerical libraries.
Automated verification steps could be added to the generated code to catch solver mismatches before execution.
This code-writing approach may reduce the need to train specialized neural approximators for routine scientific problems where standard solvers already exist.

Load-bearing premise

The reference solutions used to judge numerical validity are treated as ground truth without adjustments for solver tolerances or post-hoc selection that could change reported accuracy.

What would settle it

A collection of standard ODE problems on which guided LLM-generated code consistently returns results that differ beyond numerical tolerance from outputs of established library solvers on identical inputs.

Figures

Figures reproduced from arXiv: 2509.09936 by Amir Gholami, Dmitriy Morozov, Haocheng Xi, Kurt Keutzer, Michael W. Mahoney, Rishabh Tiwari, Saarth Gaonkar, Xiang Zheng.

read the original abstract

Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics-informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain-aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural-language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non-stiffness; and the large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction-following models achieve high accuracy on both criteria. In many cases, recent open-source systems perform strongly without fine-tuning, while older or smaller models still benefit from fine-tuning. Overall, our preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs can serve as SciML agents by generating executable, domain-appropriate code for ODE initial-value problems rather than directly learning solution functions. It introduces a diagnostic benchmark of adversarial 'misleading' problems that appear stiff but are not, plus a 1000-task benchmark spanning stiff and non-stiff regimes. Evaluation across open- and closed-source models shows that guided prompting with domain knowledge yields high executability and numerical validity (measured against reference solutions), with newer instruction-tuned models performing well even without fine-tuning.

Significance. If the central empirical claims hold after clarification of reference-solution generation, the work offers a practical alternative to neural-ODE-style methods by delegating numerical integration to established libraries while using LLMs only for solver selection and code synthesis. The new benchmarks themselves constitute a reusable resource for assessing LLM-based scientific computing agents.

major comments (3)

[Evaluation / Benchmark description] Evaluation section (and abstract): numerical validity is reported against 'reference solutions' with no description of the integrator, absolute/relative tolerances, time-stepping scheme, or any post-processing/filtering used to produce those references. Because the headline accuracy numbers rest on this comparison, the absence of these details makes it impossible to assess whether the metric reflects robust scientific appropriateness or sensitivity to reference-generation choices.
[Benchmark construction] Benchmark construction: the 1000-task set is described as spanning stiff and non-stiff regimes, yet the paper provides no quantitative criteria (e.g., eigenvalue-based stiffness ratio thresholds or Jacobian condition-number cutoffs) used to label tasks. Without these, it is unclear whether the reported performance difference between guided and unguided prompts genuinely tracks stiffness handling or simply reflects easier problems.
[Results] Results presentation: accuracy figures are given without error bars, confidence intervals, or per-category breakdowns (stiff vs. non-stiff, diagnostic vs. main benchmark). This weakens the claim that 'newer instruction-following models achieve high accuracy' because the magnitude and statistical reliability of the improvement cannot be evaluated.

minor comments (2)

[Abstract / Conclusion] The abstract states the work is 'preliminary'; this qualifier should be retained or expanded in the conclusion to set appropriate expectations for the benchmark sizes and model coverage.
[Methods] Notation for the two evaluation axes (executability and numerical validity) is introduced informally; a short table or explicit definitions early in the methods would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where revisions have been made to improve clarity and completeness.

read point-by-point responses

Referee: [Evaluation / Benchmark description] Evaluation section (and abstract): numerical validity is reported against 'reference solutions' with no description of the integrator, absolute/relative tolerances, time-stepping scheme, or any post-processing/filtering used to produce those references. Because the headline accuracy numbers rest on this comparison, the absence of these details makes it impossible to assess whether the metric reflects robust scientific appropriateness or sensitivity to reference-generation choices.

Authors: We agree that the reference-solution generation process requires explicit documentation for reproducibility and to allow readers to evaluate the numerical validity metric. In the revised manuscript we have added a dedicated paragraph in the Evaluation section specifying that reference solutions were computed with scipy.integrate.solve_ivp using the BDF method for stiff problems and RK45 for non-stiff problems, with rtol=1e-8 and atol=1e-8. No post-processing or filtering steps were applied beyond the solver defaults. These choices are now stated in both the main text and the associated code repository. revision: yes
Referee: [Benchmark construction] Benchmark construction: the 1000-task set is described as spanning stiff and non-stiff regimes, yet the paper provides no quantitative criteria (e.g., eigenvalue-based stiffness ratio thresholds or Jacobian condition-number cutoffs) used to label tasks. Without these, it is unclear whether the reported performance difference between guided and unguided prompts genuinely tracks stiffness handling or simply reflects easier problems.

Authors: The referee is correct that quantitative labeling criteria were not provided in the original submission. The 1000-task benchmark was assembled from standard ODE test suites whose stiffness properties are documented in the literature; however, we did not state the explicit decision rule. In the revision we now describe the criterion used: a problem is labeled stiff when the stiffness ratio (largest to smallest absolute eigenvalue of the Jacobian evaluated at the initial condition) exceeds 10^3 or when preliminary integration tests show that explicit methods become unstable. This definition is added to the Benchmark Construction subsection together with the corresponding code used to compute the ratios. revision: yes
Referee: [Results] Results presentation: accuracy figures are given without error bars, confidence intervals, or per-category breakdowns (stiff vs. non-stiff, diagnostic vs. main benchmark). This weakens the claim that 'newer instruction-following models achieve high accuracy' because the magnitude and statistical reliability of the improvement cannot be evaluated.

Authors: We accept that the results section would be strengthened by additional statistical detail and category breakdowns. In the revised manuscript we have added tables that report accuracy separately for stiff versus non-stiff problems and for the diagnostic set versus the main 1000-task benchmark. Because the primary evaluation uses deterministic prompts on a fixed task set, exact percentages are reported; we have nevertheless included a supplementary analysis with temperature sampling (T=0.7) over five independent runs and report the resulting standard deviations as error bars for the main models. These additions allow readers to assess both the magnitude and reliability of the observed improvements. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation on newly introduced benchmarks

full rationale

The paper introduces two new datasets (a diagnostic adversarial set and a 1000-task ODE benchmark) and reports empirical results on LLM-generated code for executability and numerical validity against reference solutions. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on experimental measurements rather than any chain that reduces to its own inputs by construction. This is a standard self-contained empirical study; the evaluation does not invoke uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM code generation and numerical ODE solving; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Reference solutions exist and can be used to judge numerical validity of generated code
Invoked when measuring accuracy against reference solutions on the benchmarks.

pith-pipeline@v0.9.0 · 5873 in / 1155 out tokens · 33093 ms · 2026-05-18T17:54:07.007815+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 7.0

An ensemble-based information-theoretic active learning method with ensemble Kalman inversion selects valuable tasks to optimize communication structures in LLM multi-agent systems under constrained budgets.
Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 6.0

An ensemble-based information-theoretic active learning method using ensemble Kalman inversion selects valuable tasks to optimize communication structures in LLM multi-agent systems more reliably than random sampling ...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Karniadakis

Maziar Raissi, Paris Perdikaris, and George E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019

work page 2019
[2]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

work page 2018
[3]

Anode: Unconditionally accurate memory-efficient gradi- ents for neural odes

Amir Gholami, Kurt Keutzer, and George Biros. Anode: Unconditionally accurate memory-efficient gradi- ents for neural odes. InProceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 730–736, 2019

work page 2019
[4]

Karniadakis

Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3):218–229, 2021

work page 2021
[5]

Model reduction and neural networks for parametric pdes.The SMAI journal of computational mathematics, 7:121–157, 2021

Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes.The SMAI journal of computational mathematics, 7:121–157, 2021. 9

work page 2021
[6]

Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew M

Nikola B. Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

work page 2023
[7]

The random feature model for input-output maps between banach spaces.SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021

Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between banach spaces.SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021

work page 2021
[8]

Neural operator: Graph kernel network for partial differential equations

Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki, Zongyi Li, Burigede Liu, and Andrew Stuart. Neural operator: Graph kernel network for partial differential equations. InICLR 2020 workshop on integration of deep neural models and differential equations, 2020

work page 2020
[9]

A physics-informed operator regression framework for extracting data-driven continuum models.Computer Methods in Applied Mechanics and Engineering, 373:113500, 2021

Ravi G Patel, Nathaniel A Trask, Mitchell A Wood, and Eric C Cyr. A physics-informed operator regression framework for extracting data-driven continuum models.Computer Methods in Applied Mechanics and Engineering, 373:113500, 2021

work page 2021
[10]

Solving ill-posed inverse problems using iterative deep neural networks

Jonas Adler and Ozan Öktem. Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems, 33(12):124007, 2017

work page 2017
[11]

Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2):525–545, 2019

Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2):525–545, 2019

work page 2019
[12]

Solving parametric pde problems with artificial neural networks.European Journal of Applied Mathematics, 32(3):421–435, 2021

Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks.European Journal of Applied Mathematics, 32(3):421–435, 2021

work page 2021
[13]

The deep ritz method: a deep learning-based numerical algorithm for solving variational problems.Communications in Mathematics and Statistics, 6(1):1–12, 2018

Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems.Communications in Mathematics and Statistics, 6(1):1–12, 2018

work page 2018
[14]

Unsupervised Deep Learning Algorithm for PDE-based Forward and Inverse Problems

Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for pde-based forward and inverse problems.arXiv preprint arXiv:1904.05417, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[15]

Fourier neural operator for parametric partial differential equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[16]

Bcr-net: A neural network based on the nonstandard wavelet form.Journal of Computational Physics, 384:1–15, 2019

Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the nonstandard wavelet form.Journal of Computational Physics, 384:1–15, 2019

work page 2019
[17]

Mahoney, and Maarten V

Jose Antonio Lara Benitez, Junyi Guo, Kareem Hegazy, Ivan Dokmanic, Michael W. Mahoney, and Maarten V . de Hoop. Neural equilibria for long-term prediction of nonlinear conservation laws. Technical report, arXiv preprint arXiv:2501.06933, 2025

work page internal anchor Pith review arXiv 2025
[18]

Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M

Aditi S. Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M. Kirby, and Michael W. Mahoney. Characterizing possible failure modes in physics-informed neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021
[19]

Hairer, S

E. Hairer, S. P. Nørsett, and G. Wanner.Solving ordinary differential equations I (2nd revised. ed.): nonstiff problems. Springer-Verlag, Berlin, Heidelberg, 1993

work page 1993
[20]

Measuring coding challenge competence with apps

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. InNeurIPS 2021 Track on Datasets and Benchmarks, 2021

work page 2021
[21]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Łukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Internatio...

work page 2023
[24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations (ICLR), 2024. Oral

work page 2024
[25]

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Robby T. Tan. Codejudgebench: Bench- marking llm-as-a-judge for coding tasks.arXiv preprint arXiv:2507.10535, 2025

work page arXiv 2025
[26]

Solving inequality proofs with large language models.arXiv preprint arXiv:2506.07927, 2025

Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models.arXiv preprint arXiv:2506.07927, 2025

work page arXiv 2025
[27]

Proving olympiad inequalities by synergizing llms and symbolic reasoning

Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, and Xiaoxing Ma. Proving olympiad inequalities by synergizing llms and symbolic reasoning. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[28]

Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

work page arXiv 2024
[29]

Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

work page arXiv 2024
[30]

Asymob: Algebraic symbolic mathematical operations benchmark, 2025

Michael Shalyt, Rotem Elimelech, and Ido Kaminer. Asymob: Algebraic symbolic mathematical operations benchmark.arXiv preprint arXiv:2505.23851, 2025

work page arXiv 2025
[31]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 10764–10799. PMLR, 2023

work page 2023
[32]

Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[33]

arXiv preprint arXiv:2505.08783 , year=

Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, and Ameet Talwalkar. Codepde: An inference framework for llm-driven pde solver generation.arXiv preprint arXiv:2505.08783, 2025

work page arXiv 2025
[34]

Pde-controller: LLMs for autoformalization and reasoning of PDEs

Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, and Wuyang Chen. Pde-controller: LLMs for autoformalization and reasoning of PDEs. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, Vancouver, Canada,

work page
[35]

Deepseek vs

Qile Jiang, Zhiwei Gao, and George Em Karniadakis. Deepseek vs. chatgpt vs. claude: A comparative study for scientific computing and scientific machine learning tasks.Theoretical and Applied Mechanics Letters, 15(3):100583, 2025. 11 A Prompt Engineering for ODE-1000 Dataset In this section, we present prompts we adopt to generate and evaluate the ODE-1000...

work page 2025
[36]

description

t_eval = np.linspace(0, 40, 250) sol = solve_ivp(f, [0, 40], [50], t_eval=t_eval, method=’RK45’) Now, here’s the problem you need to solve: example["description"] In the final output, only return the correct code (function and call to solve_ivp), nothing else. Use "sol" to refer to the solution. Make sure that the code is syntactically correct, and do not...

work page
[37]

A clear description of the ODE in terms of variables of your choice (you should always give away t_eval and number of evaluation points)

work page
[38]

The correct corresponding sympy equation

work page
[39]

The correct initial condition (in valid sympy format)

work page
[40]

Reasoning to set up/solve the ODE with the correct solve_ivp parameters

work page
[41]

description

The correct Python code using solve_ivp with appropriate parameters The sympy equation and initial conditions should be in terms of y(t) even if the original equation is given in terms of different variables. The examples should be diverse in terms of: - Different types of ODEs (different orders, stiff, non-stiff, etc.) - Different correct usages of numer...

work page
[42]

First, provide your detailed reasoning about why this ODE is stiff or non-stiff

work page
[43]

explicit solver

Then, provide your final answer as either "explicit solver" or "implicit solver". Don’t mention anything else, just the solver choice. Please format your response in json format as follows. IT MUST BE V ALID JSON: { "reason": "Your detailed analysis here", "answer": "explicit solver/implicit solver" } 15 Guiding System Prompt You are an expert in numerica...

work page
[44]

Given that the ODE is in the form of dy/dt = p(t)y + q(t), please first try to simplify the p(t) and q(t) to make it easier to solve

work page
[45]

Then, provide your detailed reasoning about why this ODE is stiff or non-stiff

work page
[46]

explicit solver

Finally, provide your final answer as either "explicit solver" or "implicit solver". Don’t mention anything else, just the solver choice. Please format your response in json format as follows. IT MUST BE V ALID JSON: { "reason": "Your detailed analysis here", "answer": "explicit solver/implicit solver" } 16 D Model Evaluation on Diagnostic Dataset In this...

work page
[47]

Such a large coefficient creates rapid changes in y with respect to t and leads to the presence of both fast and slow dynamics

Failed Case •ODE example: dy/dt = (50000*(cos(2*t) - 2*cos(t)**2 + 1))*y + sin(2*t), •Traces of Failed Reasoning : This ODE has the derivative term dy/dt involving a large coefficient (50000) multiplied by y and trigonometric functions of t. Such a large coefficient creates rapid changes in y with respect to t and leads to the presence of both fast and sl...

work page
[48]

This reduces to dy/dt = cos(t), a very simple, non-stiff ODE with no dependence on the large coefficient (since it cancels out)

Successful Case •ODE example: dy/dt = (20000*(sin(t)**2 + cos(t)**2) - 20000)*y + cos(t), •Traces of Successful Reasoning : The ODE simplifies (sin(t)2 +cos(t) 2 = 1), so dy/dt = (20000*1 - 20000)*y + cos(t) = cos(t). This reduces to dy/dt = cos(t), a very simple, non-stiff ODE with no dependence on the large coefficient (since it cancels out). There are ...

work page
[49]

This is characteristic of stiff ODEs, as explicit methods may require prohibitively small step sizes for stability over even a short interval

Failed Case •ODE example: dy/dt = (4e7 * (sin(arcsin(t/2)) - t/2))*y + exp(t), •Domain: t∈[0, 1], •Traces of Failed Reasoning : The ODE’s coefficient of y contains a large con- stant factor (4e7) that multiplies a function of t, which can lead to very rapid changes in the solution (potentially large negative or positive eigenvalues in the Jacobian). This ...

work page
[50]

The Jacobian with respect to y is 0, so there are no stiff eigenvalues and the system is not stiff

Successful Case •ODE example: dy/dt = (1e7 * log(exp(t + 1)) - 1e7 * (t + 1))*y + cos(t), •Domain: t∈[0, 1], •Traces of Successful Reasoning : The log(exp(t + 1)) term equals t + 1, so 1e7 * log(exp(t + 1)) - 1e7 * (t + 1) cancels to 0. The Jacobian with respect to y is 0, so there are no stiff eigenvalues and the system is not stiff. Only the cos(t) term...

work page
[51]

The right side shows linear dependence on y with a time-varying coefficient P(t) = (t−5) 4 −t 4 + 20t3 −150t 2 + 500t−625

Failed Case 19 •ODE example : dy/dt = ((t-5)**4 - t**4 + 20*t**3 - 150*t**2 + 500*t - 625)*y + sqrt(t+3), •Traces of Failed Reasoning : To determine stiffness, let’s analyze the ODE dy/dt= ((t−5) 4 −t 4 + 20∗t 3 −150∗t 2 + 500∗t−625)∗y+sqrt(t+ 3) . The right side shows linear dependence on y with a time-varying coefficient P(t) = (t−5) 4 −t 4 + 20t3 −150t...

work page
[52]

The coefficient of y simplifies as follows: (t+ 5) 3 −t 3 −15t 2 −75t−125 = (t3 + 15t2 + 75t+ 125)−t 3 −15t 2 −75t−125 = 0

Successful Case •ODE example: dy/dt = ((t+5)**3 - t**3 - 15*t**2 - 75*t - 125)*y + sqrt(t+2), •Traces of Successful Reasoning : To assess stiffness, examine the ODE’s right-hand side: dy/dt= ((t+ 5) 3 −t 3 −15t 2 −75t−125)∗y+sqrt(t+ 2) . The coefficient of y simplifies as follows: (t+ 5) 3 −t 3 −15t 2 −75t−125 = (t3 + 15t2 + 75t+ 125)−t 3 −15t 2 −75t−125 ...

work page

[1] [1]

Karniadakis

Maziar Raissi, Paris Perdikaris, and George E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019

work page 2019

[2] [2]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

work page 2018

[3] [3]

Anode: Unconditionally accurate memory-efficient gradi- ents for neural odes

Amir Gholami, Kurt Keutzer, and George Biros. Anode: Unconditionally accurate memory-efficient gradi- ents for neural odes. InProceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 730–736, 2019

work page 2019

[4] [4]

Karniadakis

Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3):218–229, 2021

work page 2021

[5] [5]

Model reduction and neural networks for parametric pdes.The SMAI journal of computational mathematics, 7:121–157, 2021

Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes.The SMAI journal of computational mathematics, 7:121–157, 2021. 9

work page 2021

[6] [6]

Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew M

Nikola B. Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

work page 2023

[7] [7]

The random feature model for input-output maps between banach spaces.SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021

Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between banach spaces.SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021

work page 2021

[8] [8]

Neural operator: Graph kernel network for partial differential equations

Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki, Zongyi Li, Burigede Liu, and Andrew Stuart. Neural operator: Graph kernel network for partial differential equations. InICLR 2020 workshop on integration of deep neural models and differential equations, 2020

work page 2020

[9] [9]

A physics-informed operator regression framework for extracting data-driven continuum models.Computer Methods in Applied Mechanics and Engineering, 373:113500, 2021

Ravi G Patel, Nathaniel A Trask, Mitchell A Wood, and Eric C Cyr. A physics-informed operator regression framework for extracting data-driven continuum models.Computer Methods in Applied Mechanics and Engineering, 373:113500, 2021

work page 2021

[10] [10]

Solving ill-posed inverse problems using iterative deep neural networks

Jonas Adler and Ozan Öktem. Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems, 33(12):124007, 2017

work page 2017

[11] [11]

Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2):525–545, 2019

Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2):525–545, 2019

work page 2019

[12] [12]

Solving parametric pde problems with artificial neural networks.European Journal of Applied Mathematics, 32(3):421–435, 2021

Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks.European Journal of Applied Mathematics, 32(3):421–435, 2021

work page 2021

[13] [13]

The deep ritz method: a deep learning-based numerical algorithm for solving variational problems.Communications in Mathematics and Statistics, 6(1):1–12, 2018

Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems.Communications in Mathematics and Statistics, 6(1):1–12, 2018

work page 2018

[14] [14]

Unsupervised Deep Learning Algorithm for PDE-based Forward and Inverse Problems

Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for pde-based forward and inverse problems.arXiv preprint arXiv:1904.05417, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[15] [15]

Fourier neural operator for parametric partial differential equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations (ICLR), 2021

work page 2021

[16] [16]

Bcr-net: A neural network based on the nonstandard wavelet form.Journal of Computational Physics, 384:1–15, 2019

Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the nonstandard wavelet form.Journal of Computational Physics, 384:1–15, 2019

work page 2019

[17] [17]

Mahoney, and Maarten V

Jose Antonio Lara Benitez, Junyi Guo, Kareem Hegazy, Ivan Dokmanic, Michael W. Mahoney, and Maarten V . de Hoop. Neural equilibria for long-term prediction of nonlinear conservation laws. Technical report, arXiv preprint arXiv:2501.06933, 2025

work page internal anchor Pith review arXiv 2025

[18] [18]

Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M

Aditi S. Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M. Kirby, and Michael W. Mahoney. Characterizing possible failure modes in physics-informed neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021

[19] [19]

Hairer, S

E. Hairer, S. P. Nørsett, and G. Wanner.Solving ordinary differential equations I (2nd revised. ed.): nonstiff problems. Springer-Verlag, Berlin, Heidelberg, 1993

work page 1993

[20] [20]

Measuring coding challenge competence with apps

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. InNeurIPS 2021 Track on Datasets and Benchmarks, 2021

work page 2021

[21] [21]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Łukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th Internatio...

work page 2023

[24] [24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations (ICLR), 2024. Oral

work page 2024

[25] [25]

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Robby T. Tan. Codejudgebench: Bench- marking llm-as-a-judge for coding tasks.arXiv preprint arXiv:2507.10535, 2025

work page arXiv 2025

[26] [26]

Solving inequality proofs with large language models.arXiv preprint arXiv:2506.07927, 2025

Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models.arXiv preprint arXiv:2506.07927, 2025

work page arXiv 2025

[27] [27]

Proving olympiad inequalities by synergizing llms and symbolic reasoning

Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, and Xiaoxing Ma. Proving olympiad inequalities by synergizing llms and symbolic reasoning. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[28] [28]

Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

work page arXiv 2024

[29] [29]

Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

Xuhan Huang, Qingning Shen, Yan Hu, Anningzhe Gao, and Benyou Wang. Mamo: A mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024

work page arXiv 2024

[30] [30]

Asymob: Algebraic symbolic mathematical operations benchmark, 2025

Michael Shalyt, Rotem Elimelech, and Ido Kaminer. Asymob: Algebraic symbolic mathematical operations benchmark.arXiv preprint arXiv:2505.23851, 2025

work page arXiv 2025

[31] [31]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 10764–10799. PMLR, 2023

work page 2023

[32] [32]

Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[33] [33]

arXiv preprint arXiv:2505.08783 , year=

Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, and Ameet Talwalkar. Codepde: An inference framework for llm-driven pde solver generation.arXiv preprint arXiv:2505.08783, 2025

work page arXiv 2025

[34] [34]

Pde-controller: LLMs for autoformalization and reasoning of PDEs

Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, and Wuyang Chen. Pde-controller: LLMs for autoformalization and reasoning of PDEs. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, Vancouver, Canada,

work page

[35] [35]

Deepseek vs

Qile Jiang, Zhiwei Gao, and George Em Karniadakis. Deepseek vs. chatgpt vs. claude: A comparative study for scientific computing and scientific machine learning tasks.Theoretical and Applied Mechanics Letters, 15(3):100583, 2025. 11 A Prompt Engineering for ODE-1000 Dataset In this section, we present prompts we adopt to generate and evaluate the ODE-1000...

work page 2025

[36] [36]

description

t_eval = np.linspace(0, 40, 250) sol = solve_ivp(f, [0, 40], [50], t_eval=t_eval, method=’RK45’) Now, here’s the problem you need to solve: example["description"] In the final output, only return the correct code (function and call to solve_ivp), nothing else. Use "sol" to refer to the solution. Make sure that the code is syntactically correct, and do not...

work page

[37] [37]

A clear description of the ODE in terms of variables of your choice (you should always give away t_eval and number of evaluation points)

work page

[38] [38]

The correct corresponding sympy equation

work page

[39] [39]

The correct initial condition (in valid sympy format)

work page

[40] [40]

Reasoning to set up/solve the ODE with the correct solve_ivp parameters

work page

[41] [41]

description

The correct Python code using solve_ivp with appropriate parameters The sympy equation and initial conditions should be in terms of y(t) even if the original equation is given in terms of different variables. The examples should be diverse in terms of: - Different types of ODEs (different orders, stiff, non-stiff, etc.) - Different correct usages of numer...

work page

[42] [42]

First, provide your detailed reasoning about why this ODE is stiff or non-stiff

work page

[43] [43]

explicit solver

Then, provide your final answer as either "explicit solver" or "implicit solver". Don’t mention anything else, just the solver choice. Please format your response in json format as follows. IT MUST BE V ALID JSON: { "reason": "Your detailed analysis here", "answer": "explicit solver/implicit solver" } 15 Guiding System Prompt You are an expert in numerica...

work page

[44] [44]

Given that the ODE is in the form of dy/dt = p(t)y + q(t), please first try to simplify the p(t) and q(t) to make it easier to solve

work page

[45] [45]

Then, provide your detailed reasoning about why this ODE is stiff or non-stiff

work page

[46] [46]

explicit solver

Finally, provide your final answer as either "explicit solver" or "implicit solver". Don’t mention anything else, just the solver choice. Please format your response in json format as follows. IT MUST BE V ALID JSON: { "reason": "Your detailed analysis here", "answer": "explicit solver/implicit solver" } 16 D Model Evaluation on Diagnostic Dataset In this...

work page

[47] [47]

Such a large coefficient creates rapid changes in y with respect to t and leads to the presence of both fast and slow dynamics

Failed Case •ODE example: dy/dt = (50000*(cos(2*t) - 2*cos(t)**2 + 1))*y + sin(2*t), •Traces of Failed Reasoning : This ODE has the derivative term dy/dt involving a large coefficient (50000) multiplied by y and trigonometric functions of t. Such a large coefficient creates rapid changes in y with respect to t and leads to the presence of both fast and sl...

work page

[48] [48]

This reduces to dy/dt = cos(t), a very simple, non-stiff ODE with no dependence on the large coefficient (since it cancels out)

Successful Case •ODE example: dy/dt = (20000*(sin(t)**2 + cos(t)**2) - 20000)*y + cos(t), •Traces of Successful Reasoning : The ODE simplifies (sin(t)2 +cos(t) 2 = 1), so dy/dt = (20000*1 - 20000)*y + cos(t) = cos(t). This reduces to dy/dt = cos(t), a very simple, non-stiff ODE with no dependence on the large coefficient (since it cancels out). There are ...

work page

[49] [49]

This is characteristic of stiff ODEs, as explicit methods may require prohibitively small step sizes for stability over even a short interval

Failed Case •ODE example: dy/dt = (4e7 * (sin(arcsin(t/2)) - t/2))*y + exp(t), •Domain: t∈[0, 1], •Traces of Failed Reasoning : The ODE’s coefficient of y contains a large con- stant factor (4e7) that multiplies a function of t, which can lead to very rapid changes in the solution (potentially large negative or positive eigenvalues in the Jacobian). This ...

work page

[50] [50]

The Jacobian with respect to y is 0, so there are no stiff eigenvalues and the system is not stiff

Successful Case •ODE example: dy/dt = (1e7 * log(exp(t + 1)) - 1e7 * (t + 1))*y + cos(t), •Domain: t∈[0, 1], •Traces of Successful Reasoning : The log(exp(t + 1)) term equals t + 1, so 1e7 * log(exp(t + 1)) - 1e7 * (t + 1) cancels to 0. The Jacobian with respect to y is 0, so there are no stiff eigenvalues and the system is not stiff. Only the cos(t) term...

work page

[51] [51]

The right side shows linear dependence on y with a time-varying coefficient P(t) = (t−5) 4 −t 4 + 20t3 −150t 2 + 500t−625

Failed Case 19 •ODE example : dy/dt = ((t-5)**4 - t**4 + 20*t**3 - 150*t**2 + 500*t - 625)*y + sqrt(t+3), •Traces of Failed Reasoning : To determine stiffness, let’s analyze the ODE dy/dt= ((t−5) 4 −t 4 + 20∗t 3 −150∗t 2 + 500∗t−625)∗y+sqrt(t+ 3) . The right side shows linear dependence on y with a time-varying coefficient P(t) = (t−5) 4 −t 4 + 20t3 −150t...

work page

[52] [52]

The coefficient of y simplifies as follows: (t+ 5) 3 −t 3 −15t 2 −75t−125 = (t3 + 15t2 + 75t+ 125)−t 3 −15t 2 −75t−125 = 0

Successful Case •ODE example: dy/dt = ((t+5)**3 - t**3 - 15*t**2 - 75*t - 125)*y + sqrt(t+2), •Traces of Successful Reasoning : To assess stiffness, examine the ODE’s right-hand side: dy/dt= ((t+ 5) 3 −t 3 −15t 2 −75t−125)∗y+sqrt(t+ 2) . The coefficient of y simplifies as follows: (t+ 5) 3 −t 3 −15t 2 −75t−125 = (t3 + 15t2 + 75t+ 125)−t 3 −15t 2 −75t−125 ...

work page