pith. sign in

arxiv: 2603.21011 · v2 · submitted 2026-01-08 · 💻 cs.CE · cs.AI· cs.LG· cs.MS· cs.NA· math.NA

ALL-FEM: Agentic Large Language models Fine-tuned for Finite Element Methods

Pith reviewed 2026-05-16 15:45 UTC · model grok-4.3

classification 💻 cs.CE cs.AIcs.LGcs.MScs.NAmath.NA
keywords finite element methodsFEniCSlarge language modelsagentic AIcode generationcomputational engineeringmulti-agent systemsPDE simulation
0
0 comments X

The pith

Fine-tuned LLMs inside a multi-agent loop with runtime feedback generate correct FEniCS code for 71.79 percent of tested finite-element problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that domain-specific fine-tuning of LLMs on verified FEniCS scripts, when placed inside an agentic workflow that includes problem formulation, code generation, debugging, and result visualization, produces working simulation code for solid mechanics, fluid flow, and multiphysics problems. Conventional LLMs often hallucinate or ignore variational structure, but the added agents and feedback loop close that gap enough for the 120B-parameter model to exceed the success rate of a larger non-agentic model. A training corpus of more than one thousand expert and generated scripts supplies the necessary examples across linear and nonlinear elasticity, Newtonian and non-Newtonian fluids, fluid-structure interaction, and moving domains.

Core claim

An autonomous system called ALL-FEM orchestrates specialized agents powered by LLMs fine-tuned on a corpus of 1000+ verified FEniCS scripts; when the best such model (GPT OSS 120B) operates inside the multi-agent workflow with runtime feedback, it reaches 71.79 percent code-level success on 39 benchmarks that cover linear/nonlinear elasticity, plasticity, Newtonian/non-Newtonian flow, thermofluids, fluid-structure interaction, phase separation, and transport on moving domains, surpassing a non-agentic deployment of GPT 5 Thinking.

What carries the argument

The multi-agent workflow with runtime feedback that uses fine-tuned LLMs to translate problem statements into PDEs, generate and debug FEniCS code, and visualize results.

If this is right

  • Engineers can obtain working simulation code from natural-language problem statements without writing or debugging the code themselves.
  • The same agentic pattern supplies a template for automating other computational-science workflows that require both code generation and runtime verification.
  • Smaller, fine-tuned models become competitive with much larger general models once they are embedded in domain-specific agent loops.
  • Rapid iteration over geometries, material laws, and boundary conditions becomes feasible for design exploration in manufacturing and research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on industrial-scale meshes and nonlinear solvers to measure how far the success rate drops when problem size increases.
  • Integration with geometry kernels or CAD files would let the system accept drawings rather than text descriptions as input.
  • Similar fine-tuning plus agent orchestration might apply to other open-source simulation libraries beyond FEniCS.

Load-bearing premise

The 39 benchmarks and the automated verification process capture the full range of real-world finite-element problems without missing subtle numerical or physical errors that would appear only in production use.

What would settle it

Running the generated codes on a new set of problems outside the 39 benchmarks and checking whether their numerical solutions match independent reference results or experimental data to within engineering tolerances.

read the original abstract

Finite element (FE) analysis guides the design and verification of nearly all manufactured objects. It is at the core of computational engineering, enabling simulation of complex physical systems, from fluids and solids to multiphysics systems. However, implementing FE codes and analyzing simulation results demands expertise across numerical analysis, continuum mechanics, and programming. Conventional Large Language Models (LLMs) can generate FE code, but they hallucinate, lack awareness of variational structures, and cannot close the loop from problem statement to a verified solution. Here, we propose ALL-FEM, an autonomous simulation system that integrates agentic AI with domain-specific, fine-tuned LLMs for FEniCS code generation across solid, fluid, and multiphysics applications. We construct a corpus of 1000+ verified FEniCS scripts by combining 500+ curated expert codes with a retrieval-augmented, multi-LLM pipeline that generates and filters codes for diverse PDEs, geometries, and boundary conditions. We used the corpus to fine-tune LLMs with 3B to 120B parameters. Our agentic framework orchestrates specialized agents, powered by fine-tuned LLMs, to formulate problems as PDEs, generate and debug code and visualize the results. We evaluated the system on 39 benchmarks that include problems of linear/nonlinear elasticity, plasticity, Newtonian/non-Newtonian flow, thermofluids, fluid-structure interaction, phase separation, and transport on moving domains. Embedded in a multi-agent workflow with runtime feedback, the best fine-tuned model (GPT OSS 120B) achieves code-level success of 71.79%, outperforming a non-agentic deployment of GPT 5 Thinking. By showing that relatively small, fine-tuned LLMs, orchestrated through agentic frameworks, can automate FE workflows, ALL-FEM offers a blueprint for autonomous simulation systems in computational science and engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ALL-FEM, an autonomous agentic system that integrates fine-tuned LLMs (3B–120B parameters) with specialized agents for FEniCS code generation in finite element analysis. A corpus of 1000+ verified scripts is built from 500+ expert codes plus a retrieval-augmented multi-LLM pipeline. The system is evaluated on 39 benchmarks spanning linear/nonlinear elasticity, plasticity, Newtonian/non-Newtonian flows, thermofluids, fluid-structure interaction, phase separation, and moving-domain transport. The best model (GPT OSS 120B) embedded in a multi-agent workflow with runtime feedback achieves 71.79% code-level success, outperforming a non-agentic GPT 5 Thinking baseline.

Significance. If the verification process ensures that generated codes produce numerically accurate and physically correct solutions (rather than merely executing without runtime errors), the work would offer a practical blueprint for automating FE workflows in computational engineering. The combination of domain-specific fine-tuning and agentic orchestration with feedback demonstrates measurable gains over direct LLM prompting on variational-form and solver tasks. The scale of the corpus and the breadth of benchmark categories (solids, fluids, multiphysics) would make the result relevant to researchers seeking to reduce manual coding effort in simulation-driven design.

major comments (3)
  1. [Abstract / Evaluation] Abstract and evaluation description: the headline 71.79% code-level success rate for the 120B model is reported without any quantitative specification of the verification criteria. It is unclear whether success requires only runtime execution without errors or includes checks such as L2-norm agreement with analytical solutions, residual norms on the weak form, or observed mesh-convergence rates. This distinction is load-bearing because codes can execute while containing incorrect variational formulations, improper quadrature, or boundary-condition errors that only manifest under refinement or in coupled problems.
  2. [Corpus construction] Corpus construction: the manuscript states that 1000+ scripts were obtained by combining expert codes with a multi-LLM retrieval-augmented pipeline, yet supplies no numbers on how many candidates were generated, how many were discarded by the filtering step, per-PDE-category error rates, or the exact verification procedure used to label a script “verified.” Without these statistics the quality and diversity of the fine-tuning data cannot be assessed, undermining claims about the robustness of the resulting models.
  3. [Benchmarks] Benchmarks: the 39 problems are listed by category (elasticity, plasticity, flows, FSI, etc.) but no table or appendix enumerates the individual problems, their governing equations, mesh types, or whether analytical solutions exist for quantitative validation. Per-benchmark success rates are also absent, making it impossible to determine whether the aggregate 71.79% figure is driven by a few easy cases or holds across the full range of nonlinear and multiphysics problems.
minor comments (2)
  1. Clarify the exact model names (GPT OSS 120B, GPT 5 Thinking) and provide references or parameter counts for the baseline models in the main text.
  2. Consider adding a summary table of the 39 benchmarks that includes problem type, domain, boundary conditions, and whether an analytical solution is available for verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have highlighted important areas for clarification and expansion. We address each major comment point by point below, indicating the revisions that will be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline 71.79% code-level success rate for the 120B model is reported without any quantitative specification of the verification criteria. It is unclear whether success requires only runtime execution without errors or includes checks such as L2-norm agreement with analytical solutions, residual norms on the weak form, or observed mesh-convergence rates. This distinction is load-bearing because codes can execute while containing incorrect variational formulations, improper quadrature, or boundary-condition errors that only manifest under refinement or in coupled problems.

    Authors: We agree that the verification criteria require explicit definition. In the current manuscript, code-level success is defined as the generated FEniCS script executing without runtime errors while producing results consistent with expected physical behavior; for the subset of benchmarks possessing analytical or high-fidelity reference solutions, this includes quantitative checks such as L2-norm agreement and residual norms on the weak form. To remove any ambiguity, we will revise the abstract and add a dedicated paragraph in the Evaluation section that enumerates the precise success criteria, including runtime execution, residual and convergence checks where applicable, and the distinction between problems with and without analytical solutions. revision: yes

  2. Referee: [Corpus construction] Corpus construction: the manuscript states that 1000+ scripts were obtained by combining expert codes with a multi-LLM retrieval-augmented pipeline, yet supplies no numbers on how many candidates were generated, how many were discarded by the filtering step, per-PDE-category error rates, or the exact verification procedure used to label a script “verified.” Without these statistics the quality and diversity of the fine-tuning data cannot be assessed, undermining claims about the robustness of the resulting models.

    Authors: The referee is correct that these quantitative details are missing. The manuscript currently gives only aggregate figures. In the revision we will expand the Corpus Construction section with: (i) the total number of candidate scripts generated by the retrieval-augmented multi-LLM pipeline, (ii) the number and percentage discarded at each filtering stage together with the primary rejection reasons, (iii) per-PDE-category verification success rates, and (iv) a step-by-step description of the verification procedure, which combines automated syntax/execution tests with expert manual review for variational correctness and physical fidelity. revision: yes

  3. Referee: [Benchmarks] Benchmarks: the 39 problems are listed by category (elasticity, plasticity, flows, FSI, etc.) but no table or appendix enumerates the individual problems, their governing equations, mesh types, or whether analytical solutions exist for quantitative validation. Per-benchmark success rates are also absent, making it impossible to determine whether the aggregate 71.79% figure is driven by a few easy cases or holds across the full range of nonlinear and multiphysics problems.

    Authors: We accept that the benchmark documentation is insufficiently granular. We will add a new appendix (Appendix B) that provides a table enumerating all 39 benchmarks, including the governing PDEs, mesh types, boundary conditions, and the availability of analytical or reference solutions. In addition, the Evaluation section will be augmented with a table of per-benchmark success rates for the GPT OSS 120B model (and the non-agentic baseline) so that readers can assess performance variation across linear/nonlinear, single-physics, and multiphysics cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the ALL-FEM evaluation pipeline

full rationale

The paper constructs a training corpus via a multi-LLM retrieval-augmented pipeline and fine-tunes models on it, then reports empirical success rates on a separate set of 39 held-out benchmarks using runtime verification. This performance metric is measured directly against external benchmark problems rather than being derived by construction from the corpus-generation or verification loop. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatzes appear in the described chain; the central claim retains independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of fine-tuning and agentic orchestration rather than new physical axioms or derived equations; the main unstated premises are that LLMs can internalize variational structures from code examples and that runtime feedback reliably catches numerical errors.

axioms (2)
  • domain assumption Fine-tuned LLMs can generate syntactically and semantically correct FEniCS code for a wide range of PDEs when trained on verified examples
    Invoked in the construction of the training corpus and the reported success rates.
  • domain assumption Multi-agent workflows with runtime execution feedback can close the loop from problem statement to verified solution without external human intervention
    Central to the agentic framework description.

pith-pipeline@v0.9.0 · 5678 in / 1619 out tokens · 87737 ms · 2026-05-16T15:45:08.504337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

    cs.CE 2026-04 unverdicted novelty 5.0

    A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Dover Publications, Mineola, NY (2012)

    Hughes, T.J.R.: The Finite Element Method: Linear Static and Dynamic Finite Element Analysis. Dover Publications, Mineola, NY (2012)

  2. [2]

    Logg, K.-A

    Logg, A., Mardal, K., Wells, G. (eds.): Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book. Lecture Notes in Computational Science and Engineering, vol. 84, p. 731. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23099-8

  3. [3]

    Butterworth-Heinemann, Oxford (2005)

    Zienkiewicz, O.C., Taylor, R.L.: The Finite Element Method for Solid and Structural Mechanics. Butterworth-Heinemann, Oxford (2005). https://books.google.com/books?id=VvpU3zssDOwC

  4. [4]

    Hughes, J

    Hughes, T.J.R., Cottrell, J.A., Bazilevs, Y.: Isogeometric analysis: Cad, finite elements, nurbs, exact geometry and mesh refinement. Computer Methods in Applied Mechanics and Engineering 194(39), 4135–4195 (2005) https://doi.org/10.1016/j.cma.2004.10.008

  5. [5]

    The FEniCS Project Version 1.5

    Alnæs, M., Blechta, J., Hake, J., Johansson, A., Kehlet, B., Logg, A., Richardson, C., Ring, J., Rognes, M., Wells, G.: The FEniCS project version 1.5. Archive of Numerical Software3(100), 9–23 (2015) https://doi.org/10.11588/ans.2015.100.20553

  6. [6]

    Imperial College London and University of Oxford and Baylor University and University of Washington, (2023)

    Ham, D.A., Kelly, P.H.J., Mitchell, L., Cotter, C.J., Kirby, R.C., Sagiyama, K., Bouziani, N., Vorderwuelbecke, S., Gregory, T.J., Betteridge, J., Shapero, D.R., Nixon-Hill, R.W., Ward, C.J., Farrell, P.E., Brubeck, P.D., Marsden, I., Gibson, T.H., Homolya, M., Sun, T., McRae, A.T.T., Luporini, F., Gregory, A., Lange, M., Funke, S.W., Rathgeber, F., Berce...

  7. [7]

    Computer Methods in Applied Mechanics and Engineering 450, 118591 (2026)

    Guo, J., Park, C., Qian, D., Hughes, T.J., Liu, W.K.: Large language model-empowered next- generation computer-aided engineering. Computer Methods in Applied Mechanics and Engineering 450, 118591 (2026)

  8. [8]

    MechAgents: Large lan- guage model multi-agent collaborations can solve mechan- ics problems, generate new data, and integrate knowledge

    Ni, B., Buehler, M.J.: MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge (2023). https://arxiv.org/abs/ 2311.08166

  9. [9]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Xu, C., Chen, Y., Wang, L., Luu, A.T., Bi, W., Shi, F., Shi, S.: Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models (2025). https://arxiv.org/abs/2309.01219

  10. [10]

    A survey on evaluation of large language models

    Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., Xie, X.: A Survey on Evaluation of Large Language Models (2023). https://arxiv.org/abs/2307.03109

  11. [11]

    CodeMirage : Hallucinations in Code Generated by Large Language Models , 2024

    Agarwal, V., Pei, Y., Alamir, S., Liu, X.: CodeMirage: Hallucinations in Code Generated by Large Language Models (2025). https://arxiv.org/abs/2408.08333

  12. [12]

    More agents is all you need

    Li, J., Zhang, Q., Yu, Y., Fu, Q., Ye, D.: More Agents Is All You Need (2024). https://arxiv.org/ abs/2402.05120

  13. [13]

    https://arxiv.org/abs/2408.13406

    Tian, C., Zhang, Y.: Optimizing Collaboration of LLM based Agents for Finite Element Analysis (2024). https://arxiv.org/abs/2408.13406

  14. [14]

    Digital Discovery3(7), 1389–1409 (2024) https://doi.org/10.1039/d4dd00013g

    Ghafarollahi, A., Buehler, M.J.: Protagents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. Digital Discovery3(7), 1389–1409 (2024) https://doi.org/10.1039/d4dd00013g

  15. [15]

    Honeycomb: A flexible llm-based agent system for materials science

    Zhang, H., Song, Y., Hou, Z., Miret, S., Liu, B.: HoneyComb: A Flexible LLM-Based Agent System for Materials Science (2024). https://arxiv.org/abs/2409.00135 44

  16. [16]

    In: Advances in Neural Information Processing Systems (NeurIPS 2020) (2020)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge- intensive nlp tasks. In: Advances in Neural Information Processing Systems (NeurIPS 2020) (2020)

  17. [17]

    Physics of Fluids37(3), 035120 (2025) https://doi.org/10.1063/5.0257555

    Pandey, S., Xu, R., Wang, W., Chu, X.: Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics. Physics of Fluids37(3), 035120 (2025) https://doi.org/10.1063/5.0257555

  18. [18]

    Theoretical and Applied Mechanics Letters, 100623 (2025) https://doi.org/10.1016/ j.taml.2025.100623

    Wang, W., Xu, R., Feng, J., Zhang, Q., Pandey, S., Chu, X.: A status quo investigation of large-language models for cost-effective computational fluid dynamics automation with Open- FOAMGPT. Theoretical and Applied Mechanics Letters, 100623 (2025) https://doi.org/10.1016/ j.taml.2025.100623

  19. [19]

    URL https://arxiv.org/abs/2504.19338

    Feng, J., Xu, R., Chu, X.: Openfoamgpt 2.0: end-to-end, trustworthy automation for computational fluid dynamics. arXiv preprint arXiv:2504.19338 (2025)

  20. [20]

    arXiv preprint arXiv:2509.18178 (2025)

    Yue, L., Somasekharan, N., Zhang, T., Cao, Y., Pan, S.: Foam-agent: An end-to-end composable multi-agent framework for automating cfd simulation in openfoam. arXiv preprint arXiv:2509.18178 (2025)

  21. [21]

    Theoretical and Applied Mechanics Letters15, 100594 (2025) https: //doi.org/10.1016/j.taml.2025.100594

    Dong, Z., Lu, Z., Yang, Y.: Fine-tuning a large language model for automating computational fluid dynamics simulations. Theoretical and Applied Mechanics Letters15, 100594 (2025) https: //doi.org/10.1016/j.taml.2025.100594

  22. [22]

    Metaopenfoam: an llm-based multi-agent framework for cfd

    Chen, Y., Zhu, X., Zhou, H., Ren, Z.: Metaopenfoam: an llm-based multi-agent framework for cfd. arXiv preprint arXiv:2407.21320 (2024)

  23. [23]

    arXiv preprint arXiv:2506.02019 (2025)

    Fan, E., Hu, K., Wu, Z., Ge, J., Miao, J., Zhang, Y., Sun, H., Wang, W., Zhang, T.: Chatcfd: An llm-driven agent for end-to-end cfd automation with domain-specific structured reasoning. arXiv preprint arXiv:2506.02019 (2025)

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Hou, S., Johnson, R., Makhija, R., Chen, L., Ye, Y.: Autofea: Enhancing ai copilot by integrating finite element analysis using large language models with graph neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 24078–24087 (2025)

  25. [25]

    FeaGPT: an End-to-End agentic-AI for Finite Element Analysis

    Qi, Y., Xu, R., Chu, X.: Feagpt: an end-to-end agentic-ai for finite element analysis. arXiv preprint arXiv:2510.21993 (2025)

  26. [26]

    Masset, R

    Feng, J., Qi, Y., Xu, R., Pandey, S., Chu, X.: turbulence.ai: an end-to-end ai scientist for fluid mechanics. Theoretical and Applied Mechanics Letters, 100620 (2025) https://doi.org/10.1016/j. taml.2025.100620

  27. [27]

    Jiang, G

    Jiang, Q., Karniadakis, G.: Agenticsciml: Collaborative multi-agent systems for emergent discovery in scientific machine learning. arXiv preprint arXiv:2511.07262 (2025)

  28. [28]

    ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

    Toscano, J.D., Chen, D.T., Karniadakis, G.E.: Athena: Agentic team for hierarchical evolutionary numerical algorithms. arXiv preprint arXiv:2512.03476 (2025)

  29. [29]

    SIAM Review57(4), 483–531 (2015) https://doi.org/10.1137/ 130932715

    Benner, P., Gugercin, S., Willcox, K.: A survey of projection-based model reduction methods for parametric dynamical systems. SIAM Review57(4), 483–531 (2015) https://doi.org/10.1137/ 130932715

  30. [30]

    Accessed: 2026-02-22 (2025)

    Guo, J., Domel, G., Park, C., Zhang, H., Gumus, O.C., Lu, Y., Wagner, G.J., Qian, D., Cao, J., Hughes, T.J.R., Liu, W.K.: Tensor-decomposition-based A Priori Surrogate (TAPS) modeling for ultra large-scale simulations. Accessed: 2026-02-22 (2025). https://arxiv.org/abs/2503.13933

  31. [31]

    PaLM 2 Technical Report

    Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J.H., Shafey, L.E., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G.H., Ahn, J., Austin, J., Barham, P., Botha, J., Bradbury, J...

  32. [32]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. arXiv (2017). https://doi.org/10.48550/ARXIV.1706.03762 . https: //arxiv.org/abs/1706.03762

  33. [33]

    https://www.anthropic.com/news/ 100k-context-windows (2023)

    Anthropic: Introducing 100K Context Windows. https://www.anthropic.com/news/ 100k-context-windows (2023)

  34. [34]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Oliveira Pinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chantzi...

  35. [35]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2023). https://arxiv. org/abs/2201.11903

  36. [36]

    In: The Eleventh International Conference on Learning Representations (2023)

    Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., Zhou, D.: Self- consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=1PL1NIMMrw

  37. [37]

    Meta (2024)

    AI, M.: Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. Meta (2024). https://huggingface.co/meta-llama/Llama-3.2-3B

  38. [38]

    https://huggingface.co/Qwen/Qwen3-32B

    Qwen Team: Qwen3-32B. https://huggingface.co/Qwen/Qwen3-32B. Model card. Accessed: 18 November 2025 (2025)

  39. [39]

    https://huggingface.co/meta-llama/Llama-3

    Meta (via Hugging Face): Llama-3.3-70B-Instruct. https://huggingface.co/meta-llama/Llama-3. 3-70B-Instruct (2024)

  40. [41]

    Technical report, OpenAI (aug 2025)

    OpenAI: Gpt-5 system card. Technical report, OpenAI (aug 2025). Accessed: 2025-11-20. https: //cdn.openai.com/gpt-5-system-card.pdf

  41. [42]

    Accessed: 2025-11-20 (2024)

    AI at Meta: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Accessed: 2025-11-20 (2024). https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/

  42. [43]

    https://aider.chat/docs/leaderboards/

    Aider: Aider LLM Leaderboards. https://aider.chat/docs/leaderboards/. Accessed: 18 November 2025 (2025)

  43. [44]

    The Llama 3 Herd of Models

    Llama Team, AI@Meta: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 46

  44. [45]

    https://www.vellum.ai/blog/llama-3-3-70b-vs-gpt-4o

    Vellum AI: Llama 3.3 70B vs GPT-4o. https://www.vellum.ai/blog/llama-3-3-70b-vs-gpt-4o. Eval- uation shows GPT-4o leads on math (55% vs lower), reasoning (69% vs 44%), while Llama 3.3 70B is competitive on classification and strengths in coding, tool use and multilingual tasks; cost/latency analysis included. (2024)

  45. [46]

    https://openai.com/index/introducing-gpt-oss/

    OpenAI: Introducing gpt-oss. https://openai.com/index/introducing-gpt-oss/. Accessed: 2025-11- 19 (2025)

  46. [47]

    https://github.com/ openai/gpt-oss

    OpenAI: gpt-oss: gpt-oss-120b and gpt-oss-20b open-weight language models. https://github.com/ openai/gpt-oss. GitHub repository; accessed: 2025-11-19 (2025)

  47. [48]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding (2021). https://arxiv.org/abs/2009.03300

  48. [49]

    https://arxiv.org/abs/2311

    Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: GPQA: A Graduate-Level Google-Proof Q and A Benchmark (2023). https://arxiv.org/abs/2311. 12022

  49. [50]

    https://arxiv.org/abs/2410.03131

    Patel, B., Chakraborty, S., Suttle, W.A., Wang, M., Bedi, A.S., Manocha, D.: AIME: AI System Optimization via Multiple LLM Evaluators (2024). https://arxiv.org/abs/2410.03131

  50. [51]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI: gpt-oss-120b & gpt-oss-20b Model Card. Original PDF available from OpenAI at https:// cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai gpt-oss model card.pdf (2025). https://arxiv.org/abs/2508.10925

  51. [52]

    https://huggingface.co/docs/transformers/ en/quantization/mxfp4

    Hugging Face: MXFP4 Quantization in Transformers. https://huggingface.co/docs/transformers/ en/quantization/mxfp4. Accessed: 2025-11-24 (2025)

  52. [53]

    OpenAI: Introducing GPT-5. OpenAI. https://openai.com/index/introducing-gpt-5/ Accessed 2025-11-26

  53. [54]

    Journal of Chemical Information and Modeling62(24), 6365–6377 (2022) https://doi.org/ 10.1021/acs.jcim.2c00035

    Huang, S., Cole, J.M.: Batterybert: A pretrained language model for battery database enhance- ment. Journal of Chemical Information and Modeling62(24), 6365–6377 (2022) https://doi.org/ 10.1021/acs.jcim.2c00035

  54. [55]

    Bioinformatics , volume =

    Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2019) https://doi.org/10.1093/bioinformatics/btz682

  55. [56]

    doi: 10.1038/s41587-022-01618-2

    Madani, A., Krause, B., Greene, E.R., Subramanian, S., Mohr, B.P., Holton, J.M., Olmos, J.L., Xiong, C., Sun, Z.Z., Socher, R., Fraser, J.S., Naik, N.: Large language models generate functional protein sequences across diverse families. Nature Biotechnology41(8), 1099–1106 (2023) https: //doi.org/10.1038/s41587-022-01618-2

  56. [57]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Han, Z., Gao, C., Liu, J., Zhang, J., Zhang, S.Q.: Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (2024). https://arxiv.org/abs/2403.14608

  57. [58]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford Alpaca: An Instruction-following LLaMA model. GitHub. Retrieved from https://github.com/tatsu-lab/stanford alpaca (2023)

  58. [59]

    Langtangen, H.P., Logg, A.: Solving PDEs in Python: The FEniCS Tutorial I (Volume 3, Simula SpringerBriefs on Computing), p. 146. Springer, Cham, Switzerland (2017). https://doi.org/10. 1007/978-3-319-52462-7 . https://fenicsproject.org/tutorial/

  59. [60]

    https://github.com/ david-kamensky/mae-207-fea-for-coupled-problems

    Kamensky, D.: MAE 207 – FEA for coupled problems: Code examples. https://github.com/ david-kamensky/mae-207-fea-for-coupled-problems. Code examples for the class ”MAE 207: FEA for coupled problems” at UC San Diego (Legacy FEniCS) (2022)

  60. [61]

    https://olddocs

    The FEniCS Project: DOLFIN Python demos (legacy FEniCS documentation). https://olddocs. 47 fenicsproject.org/dolfin/latest/python/demos.html. Accessed 2025-11-19 (2017)

  61. [62]

    https://people.sc.fsu.edu/ ∼jburkardt/fenics src/fenics src.html

    Burkardt, J.: FEniCS Examples. https://people.sc.fsu.edu/ ∼jburkardt/fenics src/fenics src.html. Mirror at https://people.math.sc.edu/Burkardt/fenics src/fenics src.html. Accessed 2025-11-19 (2020)

  62. [63]

    Accessed: 2 December 2025

    OpenAI: Introducing OpenAI O3 and O4-mini. Accessed: 2 December 2025. https://openai.com/ index/introducing-o3-and-o4-mini/

  63. [64]

    https://research.google.com/colaboratory/faq.html

    Google: Colaboratory (Google Colab). https://research.google.com/colaboratory/faq.html. Ac- cessed 2025-11-19

  64. [65]

    https://blog.google/technology/ google-deepmind/gemini-model-thinking-updates-march-2025/

    Google DeepMind: Gemini 2.5: Our Most Intelligent AI Model. https://blog.google/technology/ google-deepmind/gemini-model-thinking-updates-march-2025/

  65. [67]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations (ICLR 2022) (2022). Paper and code available at https://arxiv.org/abs/ 2106.09685. https://openreview.net/forum?id=nZeVKeeFYf9

  66. [68]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient Finetuning of Quantized LLMs (2023). https://arxiv.org/abs/2305.14314

  67. [69]

    Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024

    Shuttleworth, R., Andreas, J., Torralba, A., Sharma, P.: LoRA vs Full Fine-tuning: An Illusion of Equivalence (2024). https://arxiv.org/abs/2410.21228

  68. [70]

    A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

    Kalajdzievski, D.: A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (2023). https: //arxiv.org/abs/2312.03732

  69. [71]

    Transactions on Machine Learning Research (2024)

    Biderman, D., Portes, J., Ortiz, J.J.G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., Cunningham, J.P.: LoRA learns less and forgets less. Transactions on Machine Learning Research (2024). Featured Certification

  70. [72]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization (2019). https://arxiv.org/abs/ 1711.05101

  71. [73]

    https://www.rcac.purdue.edu/ knowledge/anvil/anvilgpt

    Rosen Center for Advanced Computing: AnvilGPT User Guide. https://www.rcac.purdue.edu/ knowledge/anvil/anvilgpt. Purdue University. Accessed: 2025-11-22 (2025)

  72. [74]

    https://ollama.com/

    Ollama: Ollama. https://ollama.com/. Accessed: 2025-11-22 (2025)

  73. [75]

    https://www.rcac.purdue.edu/ knowledge/anvil/anvilgpt/api

    Rosen Center for Advanced Computing: AnvilGPT API. https://www.rcac.purdue.edu/ knowledge/anvil/anvilgpt/api. Purdue University. Accessed: 2025-11-22 (2025)

  74. [76]

    https: //huggingface.co/docs/peft/developer guides/lora

    Face, H.: PEFT LoRA Developer Guide: Merge LoRA weights into the base model. https: //huggingface.co/docs/peft/developer guides/lora. Accessed 2025-11-22 (2024)

  75. [77]

    https://github.com/ggml-org/ llama.cpp

    Gerganov, G., contributors: llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/ llama.cpp. Accessed 2025-11-22 (2023)

  76. [78]

    https://github.com/ggml-org/llama

    contributors: How to convert Hugging Face models to GGUF. https://github.com/ggml-org/llama. cpp/discussions/2948. Accessed 2025-11-22 (2024)

  77. [79]

    https://huggingface.co/docs/hub/gguf

    Face, H.: GGUF on the Hugging Face Hub. https://huggingface.co/docs/hub/gguf. Accessed 2025- 11-22 (2024)

  78. [80]

    https://docs.ollama.com/import

    Ollama: Importing a GGUF Model or Adapter. https://docs.ollama.com/import. Accessed 2025- 11-22 (2025) 48

  79. [81]

    https://gist.github.com/ Artefact2/b5f810600771265fc1e39442288e8ec9

    Artefact2: GGUF quantizations overview and recommendations. https://gist.github.com/ Artefact2/b5f810600771265fc1e39442288e8ec9. Accessed 2025-11-22 (2024)

  80. [82]

    https://docs.ollama.com/modelfile

    Ollama: Modelfile Reference. https://docs.ollama.com/modelfile. Accessed 2025-11-22 (2025)

Showing first 80 references.