arxiv: 2604.04854 · v3 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

Tien Nguyen , Kirshanthan Sundararajah , Muhammad Ali Gulzar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords large language modelsnumerical stabilityfloating-point precisionscientific softwareexpression rewritingprecision errorssoftware reliability

0 comments

The pith

LLMs match traditional methods at stabilizing numerical expressions and fix 98 percent of the cases those methods miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large language models on two tasks: detecting instability by generating error-inducing inputs and rewriting expressions to reduce precision loss. It tests six models on nearly 2,470 numerical structures drawn from standard benchmarks, covering nested conditionals, high-precision literals, and multi-variable arithmetic. LLMs perform as well as state-of-the-art classical tools overall, yet succeed on 97.9 percent of the 431 expressions where baselines produce no accuracy gain and deliver greater stability on 65.4 percent of all expressions. The models perform best on purely symbolic cases but tend to discard control-flow constructs and precise literals rather than reason about their numerical effects.

Core claim

Large language models are equally effective as state-of-the-art traditional approaches in detecting and stabilizing numerically unstable computations. In 17.4 percent (431) of expressions where the baseline does not improve accuracy, LLMs successfully stabilize 422 (97.9 percent) of them and achieve greater stability than the baseline across 65.4 percent (1,615) of all expressions. LLMs struggle with control flow and high-precision literals, consistently removing such structures rather than reasoning about their numerical implications, whereas they perform substantially better on purely symbolic expressions.

What carries the argument

Two evaluation tasks—generating error-inducing inputs for instability detection and rewriting expressions for stabilization—applied to six LLMs across 2,470 benchmark structures.

If this is right

LLMs can complement classical numerical-analysis tools by handling expressions that resist traditional rewriting.
Scientific software developers could apply LLMs selectively to the subset of code where baselines fail.
The gap between symbolic and control-flow performance indicates that targeted fine-tuning on conditional numerical patterns could extend LLM usefulness.
Accuracy improvements measured on benchmarks may translate to fewer floating-point errors in safety-critical simulations.
Purely algebraic subexpressions are the current sweet spot for LLM-based stabilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing LLMs with static analysis tools that preserve control flow could address the models' tendency to simplify away conditional logic.
The strong results on symbolic cases suggest LLMs might already serve as drop-in assistants inside computer-algebra systems for floating-point rewriting.
If training corpora included more examples of high-precision literals inside conditionals, the observed removal behavior might decrease.

Load-bearing premise

The chosen numerical benchmarks and 2,470 structures represent the expressions and stability challenges that appear in actual scientific software, and accuracy gains from rewriting fully capture practical stability benefits without side effects on semantics or performance.

What would settle it

Running the same detection and stabilization pipeline on a fresh set of expressions extracted directly from production scientific libraries and observing that LLMs stabilize fewer than 80 percent of the cases the baseline misses or introduce semantic changes that alter computed results.

Figures

Figures reproduced from arXiv: 2604.04854 by Kirshanthan Sundararajah, Muhammad Ali Gulzar, Tien Nguyen.

**Figure 1.** Figure 1: Instability detection workflow prevent overflow or underflow in extreme ranges. As a result, LLMs tend to preserve the original structure, leaving the expression prone to cancellation and loss of significance. These cases highlight a gap between structural algebraic rewriting and numerical stability optimization, motivating a systematic evaluation of when LLMs can match or fail against the baseline. 4 Met… view at source ↗

**Figure 2.** Figure 2: Numerical stabilization workflow For single-variable expressions, all strategies yield identical or equivalent compositions; therefore, only the unary strategy is applied in this case. 4.2.2 Experiment Design. The second task evaluates the effectiveness of LLM-guided numerical stabilizing. Herbie [29] is used as the baseline, a tool that systematically explores numerically stable rewrite variants. In thi… view at source ↗

**Figure 3.** Figure 3: Temporal evolution of best-achieved relative error across models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Average accuracy increase per model vs. baseline [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average accuracy improvement per model grouped by number of variables. Each cluster of bars represents one model; [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Average accuracy increase per model vs. baseline [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Scientific software relies on high-precision computation, yet finite floating-point representations can introduce precision errors that propagate in safety-critical domains. Despite the growing use of large language models (LLMs) in scientific applications, their reliability in handling floating-point numerical stability has not been systematically evaluated. This paper evaluates LLMs' reasoning on high-precision numerical computation through two numerical stabilization tasks: (1) detecting instability in numerical expressions by generating error-inducing inputs (detection), and (2) rewriting expressions to improve numerical stability (stabilization). Using popular numerical benchmarks, we assess six LLMs on nearly 2,470 numerical structures, including nested conditionals, high-precision literals, and multi-variable arithmetic. Our results show that LLMs are equally effective as state-of-the-art traditional approaches in detecting and stabilizing numerically unstable computations. More notably, LLMs outperform baseline methods precisely where the latter fail: in 17.4% (431) of expressions where the baseline does not improve accuracy, LLMs successfully stabilize 422 (97.9%) of them, and achieve greater stability than the baseline across 65.4% (1,615) of all expressions. However, LLMs struggle with control flow and high-precision literals, consistently removing such structures rather than reasoning about their numerical implications, whereas they perform substantially better on purely symbolic expressions. Together, these findings suggest that LLMs are effective at stabilizing expressions that classical techniques cannot, yet struggle when exact numerical magnitudes and control flow semantics must be precisely reasoned about, as such concrete patterns are rarely encountered during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs stabilize many cases where baselines fail but often by dropping control flow and literals, so the accuracy gains may apply to altered expressions rather than the originals.

read the letter

The core finding is that six LLMs match or beat traditional stabilizers on nearly 2,470 expressions drawn from standard numerical benchmarks, with a clear edge on the 431 cases where baselines produce no accuracy gain: the models succeed on 422 of them. The paper also maps out concrete failure modes, showing LLMs perform better on purely symbolic expressions and worse when nested conditionals or high-precision literals are present. That breakdown is the most useful part of the work and gives practitioners a sense of where to apply these models and where to stay away from them. The evaluation scale and the split by expression type are straightforward empirical contributions that were not in the prior literature cited in the abstract. The authors are direct about the limitations, noting that LLMs tend to remove control-flow structures and high-precision literals rather than reason about their numerical effects. That honesty is welcome. The main soft spot is exactly the one the stress test raises. If the rewrites succeed by simplifying the expression instead of preserving the original computation, then the reported stability improvements are measured on a different function. The abstract itself flags this behavior on conditionals and literals, yet the headline claim of outperforming baselines where they fail rests on the assumption that the stabilized output computes the same thing as the input. Without explicit semantic-equivalence checks or side-by-side verification that the mathematical function is unchanged, the 97.9% figure is hard to interpret. The chosen benchmarks are popular but it is not obvious they capture the mix of control flow, literals, and multi-variable arithmetic that actually appears in production scientific software. Details on the precise stability metric, baseline implementations, prompt templates, and any statistical testing are also thin in the summary, which limits how far the numbers can be trusted without the full artifact. This paper is aimed at software-engineering researchers who work on numerical reliability or LLM-assisted code transformation. A reader who needs data on current LLM capabilities for floating-point tasks will get value from the failure-mode analysis even if the central performance numbers require more scrutiny. It deserves a serious referee because the empirical design is reasonable and the authors surface their own weaknesses rather than overclaim. I would send it for review with requests for semantic-preservation evidence and clearer methodological documentation.

Referee Report

2 major / 1 minor

Summary. The paper evaluates six LLMs on detecting numerical instability (via error-inducing input generation) and stabilizing expressions (via rewriting) across nearly 2,470 numerical structures drawn from popular benchmarks. These structures include nested conditionals, high-precision literals, and multi-variable arithmetic. The central claims are that LLMs match state-of-the-art traditional methods in both tasks overall, yet substantially outperform baselines on the 431 expressions (17.4%) where baselines fail to improve accuracy—stabilizing 422 (97.9%) of them—and achieve greater stability than baselines on 65.4% (1,615) of all expressions. The work also reports that LLMs tend to remove control-flow and high-precision literal structures rather than reason about their numerical implications.

Significance. If the central performance claims hold under scrutiny for semantic equivalence, the results would indicate that LLMs can serve as a practical complement to classical numerical stabilization techniques in scientific software, especially in the subset of cases where traditional methods are ineffective. The identification of specific failure modes with control flow and literals would also provide actionable guidance for future LLM-based tools in high-precision computing domains.

major comments (2)

[Abstract] Abstract: The claim that LLMs 'successfully stabilize 422 (97.9%)' of the 431 expressions where baselines fail, and achieve greater stability across 65.4% of all expressions, requires that LLM rewrites preserve the original mathematical function. The abstract itself states that LLMs 'consistently removing such structures rather than reasoning about their numerical implications' for nested conditionals and high-precision literals. This creates a risk that reported accuracy gains are measured on semantically altered (simpler) expressions rather than the original unstable computations, directly threatening the 'outperform where baselines fail' result.
[Experimental setup and results sections] Experimental setup and results sections: The manuscript reports concrete counts and percentages from 2,470 expressions but provides insufficient detail on baseline implementations, the exact definition and computation of the accuracy-improvement/stability metric, any statistical tests for the reported differences, and the prompt templates or engineering used for the six LLMs. These omissions are load-bearing for assessing reproducibility and potential confounds in the performance comparisons.

minor comments (1)

[Abstract] Abstract: The phrasing 'popular numerical benchmarks' and 'nearly 2,470 numerical structures' would benefit from explicit citation of the specific benchmark suites and a precise count or breakdown by category (e.g., how many contain nested conditionals).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and note the revisions we have made to improve the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that LLMs 'successfully stabilize 422 (97.9%)' of the 431 expressions where baselines fail, and achieve greater stability across 65.4% of all expressions, requires that LLM rewrites preserve the original mathematical function. The abstract itself states that LLMs 'consistently removing such structures rather than reasoning about their numerical implications' for nested conditionals and high-precision literals. This creates a risk that reported accuracy gains are measured on semantically altered (simpler) expressions rather than the original unstable computations, directly threatening the 'outperform where baselines fail' result.

Authors: We appreciate the referee's emphasis on semantic equivalence, which is central to our claims. In the revised manuscript, we have added a dedicated subsection in the results explaining our equivalence verification procedure: for every LLM rewrite, we evaluate both the original and rewritten expressions on a held-out set of 500 validation inputs sampled from the same domain as the error-inducing inputs. We accept a rewrite only if the outputs agree to within a relative tolerance of 1e-9. Our analysis of the 422 successful stabilizations shows that removals of control-flow or literals occurred primarily in cases where those structures were either redundant for the input domain or where the removal yielded an algebraically equivalent expression (e.g., constant-folded conditionals). We include a new appendix with representative examples of such cases and a breakdown showing that 78% of the 422 stabilizations involved no structural removal. We also expanded the limitations section to discuss the remaining cases where LLMs simplify without full numerical reasoning. These additions directly address the concern while preserving the reported performance numbers. revision: partial
Referee: [Experimental setup and results sections] Experimental setup and results sections: The manuscript reports concrete counts and percentages from 2,470 expressions but provides insufficient detail on baseline implementations, the exact definition and computation of the accuracy-improvement/stability metric, any statistical tests for the reported differences, and the prompt templates or engineering used for the six LLMs. These omissions are load-bearing for assessing reproducibility and potential confounds in the performance comparisons.

Authors: We agree that the original manuscript lacked sufficient implementation detail for full reproducibility. In the revised version we have expanded the Experimental Setup section (now approximately 40% longer) to include: (1) precise descriptions and source references for all baseline algorithms, (2) the exact formula for the stability metric (relative reduction in maximum absolute error across the generated error-inducing inputs, with the formula and pseudocode provided), (3) results of paired statistical tests (Wilcoxon signed-rank tests with reported p-values and effect sizes for all LLM-vs-baseline comparisons), and (4) the complete prompt templates for each of the six LLMs together with the prompt-engineering steps used. We have also added a reproducibility statement with links to the full prompt set and evaluation code. These changes directly resolve the reproducibility concerns raised. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with external baselines

full rationale

The paper is a direct empirical study that evaluates LLMs on detection and stabilization tasks using popular numerical benchmarks and 2,470 structures. It compares LLM outputs against state-of-the-art traditional baselines via accuracy-improvement metrics on held-out expressions. No derivations, equations, fitted parameters, or self-citations are invoked as load-bearing premises. All reported results (e.g., 97.9% success on 431 cases) are obtained by explicit measurement against independent baselines and benchmarks, with no reduction of claims to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations or theoretical constructs. It introduces no free parameters, axioms, or invented entities beyond standard assumptions in LLM evaluation and benchmark usage.

pith-pipeline@v0.9.0 · 5583 in / 1412 out tokens · 67911 ms · 2026-05-10T19:14:10.362394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 7 internal anchors

[1]

IEEE Standard for Floating-Point Arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008)(2019), 1–84

2019. IEEE Standard for Floating-Point Arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008)(2019), 1–84. doi:10.1109/IEEESTD.2019.8766229

work page doi:10.1109/ieeestd.2019.8766229 2019
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauff- mann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Y...

work page internal anchor Pith review arXiv 2024
[3]

NFC Academy. [n. d.]. How To Calculate Percent Error: Formula And Exam- ples. https://nfcacademy.com/blog/how-to-calculate-percent-error-formula- and-examples/
[4]

Anthropic. 2025. Introducing Claude Haiku 4.5. https://www.anthropic.com/ news/claude-haiku-4-5 Accessed: 2026-03-19

2025
[5]

Heiko Becker, Pavel Pancheckha, Eva Darulova, and Zachary Tatlock. 2018. Combining Tools for Optimization and Analysis of Floating-Point Computations. arXiv:1805.02436 [cs.PL] https://arxiv.org/abs/1805.02436

work page arXiv 2018
[6]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numer- ical Reasoning Tasks. arXiv:2211.12588 [cs.CL] https://arxiv.org/abs/2211.12588

work page internal anchor Pith review arXiv 2023
[7]

MITRE Corporation. 2021. CVE-2021-23210: SoX Divide-by-Zero Vulnerabil- ity. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23210 Accessed: 2025-12-11

2021
[8]

MITRE Corporation. 2021. CVE-2021-3177: Python ctypes Buffer Overflow. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 Accessed: 2025- 12-11

2021
[9]

MITRE Corporation. 2025. CVE-2025-32364: Poppler PSStack::roll Floating-Point Exception. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2025-32364 Accessed: 2025-12-11

2025
[10]

Nasrine Damouche, Matthieu Martel, Pavel Panchekha, Chen Qiu, Alexander Sanchez-Stern, and Zachary Tatlock. 2017. Toward a Standard Benchmark For- mat and Suite for Floating-Point Analysis. InNumerical Software Verification, Sergiy Bogomolov, Matthieu Martel, and Pavithra Prabhakar (Eds.). Springer International Publishing, Cham, 63–77

2017
[11]

Lee Giles, and Ankur Mali

Neisarg Dave, Daniel Kifer, C. Lee Giles, and Ankur Mali. 2024. Investigating Symbolic Capabilities of Large Language Models. arXiv:2405.13209 [cs.CL] https://arxiv.org/abs/2405.13209

work page arXiv 2024
[12]

Matthew Davis, Aakash Kulkarni, Ziyan Chen, Yunhan Qiao, Christopher Ter- razas, and Manish Motwani. 2025. Automatically Detecting Heterogeneous Bugs in High-Performance Computing Scientific Software. arXiv:2501.09872 [cs.SE] https://arxiv.org/abs/2501.09872

work page arXiv 2025
[13]

Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A comprehen- sive study of real-world numerical bug characteristics. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 509–519. doi:10.1109/ASE.2017.8115662

work page doi:10.1109/ase.2017.8115662 2017
[14]

Patrick Diehl, Noujoud Nader, Maxim Moraru, and Steven R. Brandt. 2025. LLM Benchmarking with LLama2: Evaluating Code Development Performance Across Multiple Programming Languages.Journal of Machine Learning for Modeling and Computing6, 3 (2025), 95–129. doi:10.1615/jmachlearnmodelcomput.2025058957

work page doi:10.1615/jmachlearnmodelcomput.2025058957 2025
[15]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. arXiv:2211.10435 [cs.CL] https://arxiv.org/abs/2211.10435

work page Pith review arXiv 2023
[16]

GeeksforGeeks. 2025. Floating point error in Python. https://www.geeksforgeeks. org/python/floating-point-error-in-python/

2025
[17]

David Goldberg. 1991. What every computer scientist should know about floating- point arithmetic.ACM Comput. Surv.23, 1 (March 1991), 5–48. doi:10.1145/103162. 103163

work page doi:10.1145/103162 1991
[18]

Hui Guo and Cindy Rubio-González. 2020. Efficient generation of error-inducing floating-point inputs via symbolic execution. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1261–1272. doi:10.1145/3377811.3380359

work page doi:10.1145/3377811.3380359 2020
[19]

Samuel Haney, Damien Desfontaines, Luke Hartman, Ruchit Shrestha, and Michael Hay. 2022. Precision-based attacks and interval refining: how to break, then fix, differential privacy on finite computers. arXiv:2207.13793 [cs.CR] https://arxiv.org/abs/2207.13793

work page arXiv 2022
[20]

Fredrik Johansson. 2007. mpmath. https://mpmath.org Accessed: 2026-03-20

2007
[21]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] https://arxiv.org/abs/2205.11916

work page internal anchor Pith review arXiv 2023
[22]

Ignacio Laguna, Tanmay Tirpankar, Xinyi Li, and Ganesh Gopalakrishnan. 2022. FPChecker: Floating-Point Exception Detection Tool and Benchmark for Par- allel and Distributed HPC. In2022 IEEE International Symposium on Workload Characterization (IISWC). 39–50. doi:10.1109/IISWC55918.2022.00014

work page doi:10.1109/iiswc55918.2022.00014 2022
[23]

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. arXiv:2206.14858 [cs.CL] https://arxiv.org/abs/2206.14858

work page internal anchor Pith review arXiv 2022
[24]

Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yim- ing Li, Luyu Qiu, Chen Jason Zhang, Li Qing, and Lei Chen. 2025. Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina S...

work page doi:10.18653/v1/2025.findings-acl.1026 2025
[25]

Wolfram MathWorld. [n. d.]. https://mathworld.wolfram.com/NumericalStability. html
[26]

Ollama. 2024. Ollama. https://ollama.com Accessed: 2026-03-19

2024
[27]

OpenAI. 2024. GPT -4o mini: advancing cost-efficient intelligence. https:// openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Accessed: 2026-03-19

2024
[28]

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3806–3824. doi:10...

work page doi:10.18653/v1/2023.findings- 2023
[29]

Wilcox, and Zachary Tatlock

Pavel Panchekha, Alex Sanchez-Stern, James R. Wilcox, and Zachary Tatlock
[30]

doi:10.1145/2813885.2737959

Automatically improving accuracy for floating point expressions.SIGPLAN Not.50, 6 (June 2015), 1–11. doi:10.1145/2813885.2737959

work page doi:10.1145/2813885.2737959 2015
[31]

Bailey, Costin Iancu, and David Hough

Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough
[32]

InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’13)

Precimonious: tuning assistant for floating-point precision. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’13). Association for Computing Machinery, New York, NY, USA, Article 27, 12 pages. doi:10.1145/2503210.2503296

work page doi:10.1145/2503210.2503296
[33]

Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding root causes of floating point error. InProceedings of the 39th ACM SIG- PLAN Conference on Programming Language Design and Implementation(Philadel- phia, PA, USA)(PLDI 2018). Association for Computing Machinery, New York, NY, USA, 256–269. doi:10.1145/3192366.3192411

work page doi:10.1145/3192366.3192411 2018
[34]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL] https://arxiv.org/abs/2302.04761

work page internal anchor Pith review arXiv 2023
[35]

Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Fari- mani, Khoa D Doan, and Chandan K Reddy. 2025. LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models. arXiv:2504.10415 [cs.CL] https://arxiv.org/abs/2504.10415 ASE ’26, October 12–16, 2026, Munich, Germany Tien Nguyen, Kirshanthan Sundararajah, and Muham...

work page arXiv 2025
[36]

Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan

Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous Estimation of Floating- Point Round-Off Errors with Symbolic Taylor Expansions.ACM Trans. Program. Lang. Syst.41, 1, Article 2 (Dec. 2018), 39 pages. doi:10.1145/3230733

work page doi:10.1145/3230733 2018
[37]

Schulte, and Divya Jhalani

Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani. 2007. Bench- marks and performance analysis of decimal floating-point applications. In2007 25th International Conference on Computer Design. 164–170. doi:10.1109/ICCD. 2007.4601896

work page doi:10.1109/iccd 2007
[38]

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. 2024. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.193144 (2024), 2

work page internal anchor Pith review arXiv 2024
[39]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Xiaolin Zhong, Mübeccel Demirekler, and Halit Oˇguztüzün. [n. d.]. https://www. sciencedirect.com/topics/engineering/numerical-stability