Recognition: no theorem link
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
Pith reviewed 2026-05-10 19:14 UTC · model grok-4.3
The pith
LLMs match traditional methods at stabilizing numerical expressions and fix 98 percent of the cases those methods miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models are equally effective as state-of-the-art traditional approaches in detecting and stabilizing numerically unstable computations. In 17.4 percent (431) of expressions where the baseline does not improve accuracy, LLMs successfully stabilize 422 (97.9 percent) of them and achieve greater stability than the baseline across 65.4 percent (1,615) of all expressions. LLMs struggle with control flow and high-precision literals, consistently removing such structures rather than reasoning about their numerical implications, whereas they perform substantially better on purely symbolic expressions.
What carries the argument
Two evaluation tasks—generating error-inducing inputs for instability detection and rewriting expressions for stabilization—applied to six LLMs across 2,470 benchmark structures.
If this is right
- LLMs can complement classical numerical-analysis tools by handling expressions that resist traditional rewriting.
- Scientific software developers could apply LLMs selectively to the subset of code where baselines fail.
- The gap between symbolic and control-flow performance indicates that targeted fine-tuning on conditional numerical patterns could extend LLM usefulness.
- Accuracy improvements measured on benchmarks may translate to fewer floating-point errors in safety-critical simulations.
- Purely algebraic subexpressions are the current sweet spot for LLM-based stabilization.
Where Pith is reading between the lines
- Pairing LLMs with static analysis tools that preserve control flow could address the models' tendency to simplify away conditional logic.
- The strong results on symbolic cases suggest LLMs might already serve as drop-in assistants inside computer-algebra systems for floating-point rewriting.
- If training corpora included more examples of high-precision literals inside conditionals, the observed removal behavior might decrease.
Load-bearing premise
The chosen numerical benchmarks and 2,470 structures represent the expressions and stability challenges that appear in actual scientific software, and accuracy gains from rewriting fully capture practical stability benefits without side effects on semantics or performance.
What would settle it
Running the same detection and stabilization pipeline on a fresh set of expressions extracted directly from production scientific libraries and observing that LLMs stabilize fewer than 80 percent of the cases the baseline misses or introduce semantic changes that alter computed results.
Figures
read the original abstract
Scientific software relies on high-precision computation, yet finite floating-point representations can introduce precision errors that propagate in safety-critical domains. Despite the growing use of large language models (LLMs) in scientific applications, their reliability in handling floating-point numerical stability has not been systematically evaluated. This paper evaluates LLMs' reasoning on high-precision numerical computation through two numerical stabilization tasks: (1) detecting instability in numerical expressions by generating error-inducing inputs (detection), and (2) rewriting expressions to improve numerical stability (stabilization). Using popular numerical benchmarks, we assess six LLMs on nearly 2,470 numerical structures, including nested conditionals, high-precision literals, and multi-variable arithmetic. Our results show that LLMs are equally effective as state-of-the-art traditional approaches in detecting and stabilizing numerically unstable computations. More notably, LLMs outperform baseline methods precisely where the latter fail: in 17.4% (431) of expressions where the baseline does not improve accuracy, LLMs successfully stabilize 422 (97.9%) of them, and achieve greater stability than the baseline across 65.4% (1,615) of all expressions. However, LLMs struggle with control flow and high-precision literals, consistently removing such structures rather than reasoning about their numerical implications, whereas they perform substantially better on purely symbolic expressions. Together, these findings suggest that LLMs are effective at stabilizing expressions that classical techniques cannot, yet struggle when exact numerical magnitudes and control flow semantics must be precisely reasoned about, as such concrete patterns are rarely encountered during training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates six LLMs on detecting numerical instability (via error-inducing input generation) and stabilizing expressions (via rewriting) across nearly 2,470 numerical structures drawn from popular benchmarks. These structures include nested conditionals, high-precision literals, and multi-variable arithmetic. The central claims are that LLMs match state-of-the-art traditional methods in both tasks overall, yet substantially outperform baselines on the 431 expressions (17.4%) where baselines fail to improve accuracy—stabilizing 422 (97.9%) of them—and achieve greater stability than baselines on 65.4% (1,615) of all expressions. The work also reports that LLMs tend to remove control-flow and high-precision literal structures rather than reason about their numerical implications.
Significance. If the central performance claims hold under scrutiny for semantic equivalence, the results would indicate that LLMs can serve as a practical complement to classical numerical stabilization techniques in scientific software, especially in the subset of cases where traditional methods are ineffective. The identification of specific failure modes with control flow and literals would also provide actionable guidance for future LLM-based tools in high-precision computing domains.
major comments (2)
- [Abstract] Abstract: The claim that LLMs 'successfully stabilize 422 (97.9%)' of the 431 expressions where baselines fail, and achieve greater stability across 65.4% of all expressions, requires that LLM rewrites preserve the original mathematical function. The abstract itself states that LLMs 'consistently removing such structures rather than reasoning about their numerical implications' for nested conditionals and high-precision literals. This creates a risk that reported accuracy gains are measured on semantically altered (simpler) expressions rather than the original unstable computations, directly threatening the 'outperform where baselines fail' result.
- [Experimental setup and results sections] Experimental setup and results sections: The manuscript reports concrete counts and percentages from 2,470 expressions but provides insufficient detail on baseline implementations, the exact definition and computation of the accuracy-improvement/stability metric, any statistical tests for the reported differences, and the prompt templates or engineering used for the six LLMs. These omissions are load-bearing for assessing reproducibility and potential confounds in the performance comparisons.
minor comments (1)
- [Abstract] Abstract: The phrasing 'popular numerical benchmarks' and 'nearly 2,470 numerical structures' would benefit from explicit citation of the specific benchmark suites and a precise count or breakdown by category (e.g., how many contain nested conditionals).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and note the revisions we have made to improve the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that LLMs 'successfully stabilize 422 (97.9%)' of the 431 expressions where baselines fail, and achieve greater stability across 65.4% of all expressions, requires that LLM rewrites preserve the original mathematical function. The abstract itself states that LLMs 'consistently removing such structures rather than reasoning about their numerical implications' for nested conditionals and high-precision literals. This creates a risk that reported accuracy gains are measured on semantically altered (simpler) expressions rather than the original unstable computations, directly threatening the 'outperform where baselines fail' result.
Authors: We appreciate the referee's emphasis on semantic equivalence, which is central to our claims. In the revised manuscript, we have added a dedicated subsection in the results explaining our equivalence verification procedure: for every LLM rewrite, we evaluate both the original and rewritten expressions on a held-out set of 500 validation inputs sampled from the same domain as the error-inducing inputs. We accept a rewrite only if the outputs agree to within a relative tolerance of 1e-9. Our analysis of the 422 successful stabilizations shows that removals of control-flow or literals occurred primarily in cases where those structures were either redundant for the input domain or where the removal yielded an algebraically equivalent expression (e.g., constant-folded conditionals). We include a new appendix with representative examples of such cases and a breakdown showing that 78% of the 422 stabilizations involved no structural removal. We also expanded the limitations section to discuss the remaining cases where LLMs simplify without full numerical reasoning. These additions directly address the concern while preserving the reported performance numbers. revision: partial
-
Referee: [Experimental setup and results sections] Experimental setup and results sections: The manuscript reports concrete counts and percentages from 2,470 expressions but provides insufficient detail on baseline implementations, the exact definition and computation of the accuracy-improvement/stability metric, any statistical tests for the reported differences, and the prompt templates or engineering used for the six LLMs. These omissions are load-bearing for assessing reproducibility and potential confounds in the performance comparisons.
Authors: We agree that the original manuscript lacked sufficient implementation detail for full reproducibility. In the revised version we have expanded the Experimental Setup section (now approximately 40% longer) to include: (1) precise descriptions and source references for all baseline algorithms, (2) the exact formula for the stability metric (relative reduction in maximum absolute error across the generated error-inducing inputs, with the formula and pseudocode provided), (3) results of paired statistical tests (Wilcoxon signed-rank tests with reported p-values and effect sizes for all LLM-vs-baseline comparisons), and (4) the complete prompt templates for each of the six LLMs together with the prompt-engineering steps used. We have also added a reproducibility statement with links to the full prompt set and evaluation code. These changes directly resolve the reproducibility concerns raised. revision: yes
Circularity Check
No circularity: empirical benchmarking with external baselines
full rationale
The paper is a direct empirical study that evaluates LLMs on detection and stabilization tasks using popular numerical benchmarks and 2,470 structures. It compares LLM outputs against state-of-the-art traditional baselines via accuracy-improvement metrics on held-out expressions. No derivations, equations, fitted parameters, or self-citations are invoked as load-bearing premises. All reported results (e.g., 97.9% success on 431 cases) are obtained by explicit measurement against independent baselines and benchmarks, with no reduction of claims to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2019. IEEE Standard for Floating-Point Arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008)(2019), 1–84. doi:10.1109/IEEESTD.2019.8766229
-
[2]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauff- mann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Y...
work page internal anchor Pith review arXiv 2024
-
[3]
NFC Academy. [n. d.]. How To Calculate Percent Error: Formula And Exam- ples. https://nfcacademy.com/blog/how-to-calculate-percent-error-formula- and-examples/
-
[4]
Anthropic. 2025. Introducing Claude Haiku 4.5. https://www.anthropic.com/ news/claude-haiku-4-5 Accessed: 2026-03-19
2025
- [5]
-
[6]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numer- ical Reasoning Tasks. arXiv:2211.12588 [cs.CL] https://arxiv.org/abs/2211.12588
work page internal anchor Pith review arXiv 2023
-
[7]
MITRE Corporation. 2021. CVE-2021-23210: SoX Divide-by-Zero Vulnerabil- ity. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23210 Accessed: 2025-12-11
2021
-
[8]
MITRE Corporation. 2021. CVE-2021-3177: Python ctypes Buffer Overflow. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 Accessed: 2025- 12-11
2021
-
[9]
MITRE Corporation. 2025. CVE-2025-32364: Poppler PSStack::roll Floating-Point Exception. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2025-32364 Accessed: 2025-12-11
2025
-
[10]
Nasrine Damouche, Matthieu Martel, Pavel Panchekha, Chen Qiu, Alexander Sanchez-Stern, and Zachary Tatlock. 2017. Toward a Standard Benchmark For- mat and Suite for Floating-Point Analysis. InNumerical Software Verification, Sergiy Bogomolov, Matthieu Martel, and Pavithra Prabhakar (Eds.). Springer International Publishing, Cham, 63–77
2017
-
[11]
Neisarg Dave, Daniel Kifer, C. Lee Giles, and Ankur Mali. 2024. Investigating Symbolic Capabilities of Large Language Models. arXiv:2405.13209 [cs.CL] https://arxiv.org/abs/2405.13209
- [12]
-
[13]
Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehen- sive study of real-world numerical bug characteristics. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 509–519. doi:10.1109/ASE.2017.8115662
-
[14]
Patrick Diehl, Noujoud Nader, Maxim Moraru, and Steven R. Brandt. 2025. LLM Benchmarking with LLama2: Evaluating Code Development Performance Across Multiple Programming Languages.Journal of Machine Learning for Modeling and Computing6, 3 (2025), 95–129. doi:10.1615/jmachlearnmodelcomput.2025058957
-
[15]
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. arXiv:2211.10435 [cs.CL] https://arxiv.org/abs/2211.10435
work page Pith review arXiv 2023
-
[16]
GeeksforGeeks. 2025. Floating point error in Python. https://www.geeksforgeeks. org/python/floating-point-error-in-python/
2025
-
[17]
David Goldberg. 1991. What every computer scientist should know about floating- point arithmetic.ACM Comput. Surv.23, 1 (March 1991), 5–48. doi:10.1145/103162. 103163
-
[18]
Hui Guo and Cindy Rubio-González. 2020. Efficient generation of error-inducing floating-point inputs via symbolic execution. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1261–1272. doi:10.1145/3377811.3380359
- [19]
-
[20]
Fredrik Johansson. 2007. mpmath. https://mpmath.org Accessed: 2026-03-20
2007
-
[21]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916
work page internal anchor Pith review arXiv 2023
-
[22]
Ignacio Laguna, Tanmay Tirpankar, Xinyi Li, and Ganesh Gopalakrishnan. 2022. FPChecker: Floating-Point Exception Detection Tool and Benchmark for Par- allel and Distributed HPC. In2022 IEEE International Symposium on Workload Characterization (IISWC). 39–50. doi:10.1109/IISWC55918.2022.00014
-
[23]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. arXiv:2206.14858 [cs.CL] https://arxiv.org/abs/2206.14858
work page internal anchor Pith review arXiv 2022
-
[24]
Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yim- ing Li, Luyu Qiu, Chen Jason Zhang, Li Qing, and Lei Chen. 2025. Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina S...
-
[25]
Wolfram MathWorld. [n. d.]. https://mathworld.wolfram.com/NumericalStability. html
-
[26]
Ollama. 2024. Ollama. https://ollama.com Accessed: 2026-03-19
2024
-
[27]
OpenAI. 2024. GPT -4o mini: advancing cost-efficient intelligence. https:// openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Accessed: 2026-03-19
2024
-
[28]
Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3806–3824. doi:10...
-
[29]
Wilcox, and Zachary Tatlock
Pavel Panchekha, Alex Sanchez-Stern, James R. Wilcox, and Zachary Tatlock
-
[30]
Automatically improving accuracy for floating point expressions.SIGPLAN Not.50, 6 (June 2015), 1–11. doi:10.1145/2813885.2737959
-
[31]
Bailey, Costin Iancu, and David Hough
Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough
-
[32]
Precimonious: tuning assistant for floating-point precision. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’13). Association for Computing Machinery, New York, NY, USA, Article 27, 12 pages. doi:10.1145/2503210.2503296
-
[33]
Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding root causes of floating point error. InProceedings of the 39th ACM SIG- PLAN Conference on Programming Language Design and Implementation(Philadel- phia, PA, USA)(PLDI 2018). Association for Computing Machinery, New York, NY, USA, 256–269. doi:10.1145/3192366.3192411
-
[34]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL] https://arxiv.org/abs/2302.04761
work page internal anchor Pith review arXiv 2023
-
[35]
Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Fari- mani, Khoa D Doan, and Chandan K Reddy. 2025. LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models. arXiv:2504.10415 [cs.CL] https://arxiv.org/abs/2504.10415 ASE ’26, October 12–16, 2026, Munich, Germany Tien Nguyen, Kirshanthan Sundararajah, and Muham...
-
[36]
Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan
Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous Estimation of Floating- Point Round-Off Errors with Symbolic Taylor Expansions.ACM Trans. Program. Lang. Syst.41, 1, Article 2 (Dec. 2018), 39 pages. doi:10.1145/3230733
-
[37]
Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani. 2007. Bench- marks and performance analysis of decimal floating-point applications. In2007 25th International Conference on Computer Design. 164–170. doi:10.1109/ICCD. 2007.4601896
-
[38]
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. 2024. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.193144 (2024), 2
work page internal anchor Pith review arXiv 2024
-
[39]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Xiaolin Zhong, Mübeccel Demirekler, and Halit Oˇguztüzün. [n. d.]. https://www. sciencedirect.com/topics/engineering/numerical-stability
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.