Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification

Ali Saraeb

arxiv: 2504.19451 · v3 · submitted 2025-04-28 · 🧮 math.NT · cs.AI

Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification

Ali Saraeb This is my paper

Pith reviewed 2026-05-22 19:21 UTC · model grok-4.3

classification 🧮 math.NT cs.AI

keywords large language modelsDirichlet L-functionsconductor predictionalgorithm generationconjecture verificationmachine learningzeros of L-functionsnumber theory

0 comments

The pith

Large language models solve algorithmic number theory tasks at 95 percent accuracy or higher when given optimal non-spoiling hints, while a gradient-boosting classifier predicts the conductor of Dirichlet L-functions from statistical traits

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates two practical uses of artificial intelligence in number theory. It first evaluates an open-source large language model on sixty tasks drawn from textbooks and online forums, consisting of thirty algorithmic problems and thirty computational questions. With hints that guide without revealing the answer, the model reaches at least 0.95 accuracy relative to the correct output on every task. The second part trains a LightGBM multiclass classifier to predict the conductor q from moments, finite-difference statistics, and FFT magnitudes computed on the initial nontrivial zeros of the corresponding L-function. On a test set drawn from 214 randomly selected Dirichlet L-functions, the classifier reaches at least 93.9 percent accuracy for small q, supplying empirical evidence that the zeros uniquely determine the modulus.

Core claim

The initial nontrivial zeros of Dirichlet L-functions encode sufficient statistical information to identify the conductor q of the underlying character, as shown by training a LightGBM classifier on moments, finite-difference statistics, and FFT magnitudes extracted from those zeros and obtaining at least 93.9 percent test accuracy. In a separate application, the Qwen2.5-Math-7B-Instruct model produces correct algorithmic and computational outputs in number theory at 0.95 accuracy or better when supplied with an optimal non-spoiling hint.

What carries the argument

LightGBM multiclass classifier that maps a vector of statistical features (moments, finite-difference statistics, FFT magnitudes) of the first nontrivial zeros to the conductor q of the Dirichlet L-function

If this is right

Algorithmic and computational problems in number theory can be solved to high accuracy by current large language models when modest non-revealing guidance is supplied.
The folklore conjecture that the initial zeros uniquely determine the conductor holds at least for small q, because the classifier recovers q from zero-derived statistics alone.
Statistical summaries of low-lying zeros can serve as a practical proxy for recovering arithmetic invariants such as the conductor without explicit knowledge of the character.
Ensemble methods supply a concrete, reproducible route for empirically checking conjectures in analytic number theory that link zeros to arithmetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feature-based classification approach could be tested on conductors larger than those appearing in the current training set to probe how far the uniqueness extends.
Interactive hint systems might be combined with zero-based classifiers to create hybrid tools that both generate candidate algorithms and check consistency with known L-function data.
If the zero-to-conductor mapping proves robust, similar pipelines could be applied to other arithmetic objects whose L-functions are expected to carry identifying information in their low zeros.

Load-bearing premise

The chosen statistical features extracted from the initial zeros are sufficient to identify the conductor uniquely without the classifier overfitting to the 214 training examples.

What would settle it

Apply the trained classifier to the initial zeros of Dirichlet L-functions whose conductors lie outside the range seen in training and check whether accuracy remains above 90 percent or falls sharply.

read the original abstract

This paper presents two concrete applications of Artificial Intelligence to algorithmic and analytic number theory. Recent benchmarks of large language models have mainly focused on general mathematics problems and the currently infeasible objective of automated theorem proving. In the first part of this paper, we relax our ambition and focus on a more specialized domain: we evaluate the performance of the state-of-the-art open-source large language model Qwen2.5-Math-7B-Instruct on algorithmic and computational tasks in algorithmic number theory. On a benchmark of thirty algorithmic problems and thirty computational questions taken from classical number-theoretic textbooks and Math StackExchange, the model achieves at least 0.95 accuracy (relative to the true answer) on every problem or question when given an optimal non-spoiling hint. The second part of the paper empirically verifies a folklore conjecture in analytic number theory stating that the modulus \(q\) of a Dirichlet character \(\chi\) is uniquely determined by the initial nontrivial zeros \(\{\rho_1,\dots,\rho_k\}\) (for some \(k\in\mathbb{N}\)) of the corresponding Dirichlet \(L\)-function \(L(s,\chi)\). We train a LightGBM multiclass classifier to predict the conductor \(q\) for 214 randomly chosen Dirichlet \(L\)-functions from a vector of statistical features of their initial zeros (moments, finite-difference statistics, FFT magnitudes, etc.). The model empirically verifies the conjecture for small \(q\), achieving at least 93.9\% test accuracy when sufficient statistical properties of the zeros are incorporated. For the second part of the paper, code and dataset are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports solid benchmark numbers for an LLM on textbook number theory tasks and a 93.9% ML accuracy on predicting conductors from zero statistics, but both parts need clearer protocols to count as strong evidence.

read the letter

The paper shows two practical uses of AI in number theory. First, Qwen2.5-Math-7B reaches at least 0.95 accuracy on 60 algorithmic and computational problems drawn from textbooks and Math StackExchange once an optimal non-spoiling hint is supplied. Second, a LightGBM classifier trained on moments, finite differences, and FFT magnitudes from the first nontrivial zeros of 214 Dirichlet L-functions predicts the conductor q with 93.9% test accuracy for small q. Code and data are released for the second part, which is useful.

Referee Report

2 major / 2 minor

Summary. The manuscript presents two applications of AI to number theory. In the first part, it evaluates the Qwen2.5-Math-7B-Instruct LLM on a benchmark of 30 algorithmic problems and 30 computational questions drawn from classical textbooks and Math StackExchange, claiming at least 0.95 accuracy relative to the true answer when each task is supplied with an optimal non-spoiling hint. In the second part, it trains a LightGBM multiclass classifier on a vector of statistical features (moments, finite-difference statistics, FFT magnitudes, etc.) extracted from the initial nontrivial zeros of 214 randomly chosen Dirichlet L-functions to predict the conductor q, reporting at least 93.9% test accuracy and thereby empirically supporting the folklore conjecture that q is uniquely determined by these zeros for small q. Code and dataset are made available for the second part.

Significance. If the methodological details are clarified, the LLM benchmark illustrates a practical, domain-specific use of current models for algorithm generation in number theory, though the dependence on hints restricts broader claims. The classifier experiment supplies concrete empirical support for a long-standing conjecture in analytic number theory; confirmation that the feature set was fixed in advance would make this a useful data point for future theoretical investigations into the distribution of zeros. The public release of code and data for the ML component is a clear strength that facilitates independent checks.

major comments (2)

[LLM benchmark description] In the section describing the LLM benchmark: the selection process and criteria for the 'optimal non-spoiling hint' supplied to each of the 60 tasks are not specified. Because the reported 0.95 accuracy depends on these hints, the absence of a reproducible protocol undermines the central performance claim.
[LightGBM multiclass classifier] In the section on the LightGBM multiclass classifier: the paper does not state whether the feature set (moments, finite-difference statistics, FFT magnitudes, etc.) was fixed before any model training or was chosen or refined after inspecting performance on the full collection of 214 L-functions. It also omits the train/test split ratio, cross-validation procedure, and uncertainty estimates around the 93.9% test accuracy. These omissions are load-bearing for the claim that the accuracy constitutes independent empirical verification of the conjecture rather than a possible consequence of post-hoc tuning on the given examples.

minor comments (2)

[Title and abstract] The title mentions 'ensemble methods' for conjecture verification, yet the body describes a single LightGBM classifier; a brief clarification of whether any ensembling step is employed would improve consistency.
[Conjecture verification] Consider reporting the precise number k of initial zeros used to compute the feature vectors, as this parameter directly affects the strength of the uniqueness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We are pleased that the referee recognizes the potential significance of both the LLM benchmark and the empirical support for the Dirichlet conductor conjecture. Below, we provide point-by-point responses to the major comments and outline the revisions we intend to make.

read point-by-point responses

Referee: In the section describing the LLM benchmark: the selection process and criteria for the 'optimal non-spoiling hint' supplied to each of the 60 tasks are not specified. Because the reported 0.95 accuracy depends on these hints, the absence of a reproducible protocol undermines the central performance claim.

Authors: We agree with the referee that the lack of a specified protocol for hint selection limits the reproducibility of the benchmark results. The hints were selected by the authors to be minimal interventions that provide necessary background or point to an appropriate method without revealing the answer or critical steps. To address this, we will revise the manuscript to include a clear description of the hint generation criteria and include illustrative examples for several problems from the benchmark. This will allow future researchers to understand and potentially apply similar hinting strategies. revision: yes
Referee: In the section on the LightGBM multiclass classifier: the paper does not state whether the feature set (moments, finite-difference statistics, FFT magnitudes, etc.) was fixed before any model training or was chosen or refined after inspecting performance on the full collection of 214 L-functions. It also omits the train/test split ratio, cross-validation procedure, and uncertainty estimates around the 93.9% test accuracy. These omissions are load-bearing for the claim that the accuracy constitutes independent empirical verification of the conjecture rather than a possible consequence of post-hoc tuning on the given examples.

Authors: We confirm that the feature set was predetermined based on common statistical features for analyzing sequences of complex numbers, prior to any training or performance evaluation. However, we recognize the importance of explicitly documenting the experimental setup. In the revised version, we will add details on the a priori feature selection, the train/test split ratio used, the cross-validation method employed, and uncertainty measures such as confidence intervals or standard deviations obtained from multiple runs. The public code repository will be updated with comments reflecting these choices to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; classifier accuracy provides independent evidence

full rationale

The paper's second part trains LightGBM on independently computed statistical features (moments, finite-difference statistics, FFT magnitudes) extracted from the initial zeros of 214 L-functions to predict conductor q, reporting 93.9% test accuracy. This does not reduce by construction to a fitted parameter encoding q, as the features derive solely from the zeros and the test split evaluates generalization. The first part is an empirical LLM benchmark on algorithmic tasks with non-spoiling hints, not a derivation. No self-citations, self-definitional steps, or ansatzes are load-bearing. The chain is self-contained against the external benchmark of held-out classification performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions about LLM prompting and ML generalization plus the domain assumption that zero statistics suffice to determine conductors.

free parameters (1)

LightGBM hyperparameters and feature set
Tuned to achieve the reported 93.9% test accuracy on the 214 examples.

axioms (1)

domain assumption Statistical summaries of the first nontrivial zeros contain enough information to distinguish different conductors q for small values.
This is the folklore conjecture under empirical test in the second part.

pith-pipeline@v0.9.0 · 5823 in / 1378 out tokens · 71869 ms · 2026-05-22T19:21:30.274858+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train a LightGBM multiclass classifier to predict the conductor q for 214 randomly chosen Dirichlet L-functions from a vector of statistical features of their initial zeros (moments, finite-difference statistics, FFT magnitudes, etc.).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the model achieves at least 0.95 accuracy ... on every problem or question when given an optimal non-spoiling hint

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Ali-Saraeb1/AI-Algorithmic-Number-Theory: Artificial Intelligence in Number Theory: LLMs for Algorithm Generation (v2.3.4)

Ali Saraeb. Ali-Saraeb1/AI-Algorithmic-Number-Theory: Artificial Intelligence in Number Theory: LLMs for Algorithm Generation (v2.3.4). Zenodo, 2025. https://doi.org/10.5281/zenodo.15293187

work page doi:10.5281/zenodo.15293187 2025
[2]

librosa/librosa: 0.6.3,

Ali Saraeb. Ali-Saraeb1/AI-Analytic-Number-Theory: Artificial Intelligence in Number Theory: Ensem- ble Methods for Conjecture Verification (v1.0.2). Zenodo, 2025. https://doi.org/10.5281/zenodo. 15460772

work page doi:10.5281/zenodo 2025
[3]

Allal, A

J. Allal, A. Li, C. Zheng, and Z. Yang. Redefining the benchmark to evaluate code-generating LLMs. In Findings of EMNLP 2024, 2024

work page 2024
[4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[6]

M. Chen, J. Tworek, H. Jun, Q. Yuan, B. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, and I. Sutskever. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

The Llama 3 Herd of Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models (Minerva). In ICLR, 2022

work page 2022
[9]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS 2021 Datasets and Benchmarks Track, 2021

work page 2021
[10]

Kushman, Y

N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 271–281, 2014

work page 2014
[11]

Huang, S

D. Huang, S. Shi, C.-Y. Lin, J. Yin, and W.-Y. Ma. How well do computers solve math word problems? Large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 887–896, 2016

work page 2016
[12]

H. Liu, Z. Zheng, Y. Qiao, H. Duan, Z. Fei, F. Zhou, W. Zhang, S. Zhang, D. Lin, and K. Chen. MathBench: Evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark. arXiv:2405.12209 [cs.CL], 2024

work page arXiv 2024
[13]

Y. Mao, Y. Kim, and Y. Zhou. CHAMP: A competition-level dataset for fine-grained analyses of LLMs’ mathematical reasoning capabilities. In Findings of the ACL 2024 Conference, pages 1–12, 2024

work page 2024
[14]

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin. Large language models for mathematical reasoning: Progresses and challenges. In Proceedings of the Student Research Workshop at EACL 2024, pages 225–237, 2024

work page 2024
[15]

Satpute, N

A. Satpute, N. Giessing, A. Greiner-Petter, M. Schubotz, O. Teschke, A. Aizawa, and B. Gipp. Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, DC, USA, 2024. 28

work page 2024
[16]

Agrawal, P

V. Agrawal, P. Singla, A. S. Miglani, S. Garg, and A. Mangal. Give me a hint: Can LLMs take a hint to solve math problems? arXiv preprint arXiv:2410.05915, 2024

work page arXiv 2024
[17]

J. Fu, S. Huangfu, H. Yan, S.-K. Ng, and X. Qiu. Hint-before-solving prompting: Guiding LLMs to effectively utilize encoded knowledge. arXiv preprint arXiv:2402.14310, 2024

work page arXiv 2024
[18]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. arXiv:1905.13319 [cs.CL], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[19]

Patel, S

A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In NAACL ’21: Proceedings of the 2021 Conference of the North American Chapter of the ACL, pages 2080–2094, Online, 2021

work page 2021
[20]

O. Shanker. Neural network prediction of Riemann zeta zeros. Advanced Modeling and Optimization , 14(3):717–728, 2012

work page 2012
[21]

Alessandretti, A

L. Alessandretti, A. Baronchelli, and Y. H. He. ML meets Number Theory: The Data Science of Birch–Swinnerton–Dyer. arXiv:1911.02008 [math.NT], 2019

work page arXiv 1911
[22]

He, K.-H

Y.-H. He, K.-H. Lee, and T. Oliver. Machine-learning the Sato–Tate conjecture. Journal of Symbolic Computation, 111:61–72, 2022

work page 2022
[23]

Bach and J

E. Bach and J. O. Shallit. Algorithmic Number Theory: Efficient Algorithms, Vol. 1. MIT Press, Cambridge, MA, 1996

work page 1996
[24]

J. P. Buhler and P. Stevenhagen (Eds). Algorithmic Number Theory: Lattices, Number Fields, Curves and Cryptography. MSRI Publications, Vol. 44. Cambridge University Press, Cambridge, 2008

work page 2008
[25]

Cohen, A Course in Computational Algebraic Number Theory, Springer, 1993

H. Cohen, A Course in Computational Algebraic Number Theory, Springer, 1993

work page 1993
[26]

Crandall and C

R. Crandall and C. Pomerance, Prime Numbers: A Computational Perspective, Springer, 2001

work page 2001
[27]

Shoup, A Computational Introduction to Number Theory and Algebra, Cambridge University Press, 2009

V. Shoup, A Computational Introduction to Number Theory and Algebra, Cambridge University Press, 2009

work page 2009
[28]

T. M. Apostol,Introduction to Analytic Number Theory, Undergraduate Texts in Mathematics, Springer-Verlag, New York, 1976

work page 1976
[29]

Davenport, Multiplicative Number Theory, 3rd ed., Graduate Texts in Mathematics, vol

H. Davenport, Multiplicative Number Theory, 3rd ed., Graduate Texts in Mathematics, vol. 74, Springer-Verlag, New York, 2000

work page 2000
[30]

E. C. Titchmarsh, The Theory of the Riemann Zeta-Function, 2nd ed., revised by D. R. Heath-Brown, Oxford University Press, Oxford, 1986

work page 1986
[31]

The LMFDB Collaboration, The L-functions and Modular Forms Database , https://www.lmfdb.org, accessed April 2025

work page 2025
[32]

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

GPT-4 Technical Report

OpenAI, GPT-4 Technical Report, CoRR, vol. abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Nature, 323(6088):533–536, 1986. 29

work page 1986
[36]

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001

work page 2001
[37]

T. G. Dietterich. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems (MCS) , volume 1857 of Lecture Notes in Computer Science , pages 1–15. Springer, 2000

work page 2000
[38]

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems(Vol. 30, pp. 3146–3154)

work page 2017
[39]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, et al. MiniF2F: A Benchmark of Formalized Competition Mathematics. In ICLR, 2022

work page 2022
[40]

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. In ACL, pages 158–167, 2017

work page 2017
[41]

W. Chen, M. Yin, M. Ku, P. Lu, and Y. Wan. MathEval: A Comprehensive Benchmark for Mathematical Reasoning. In EMNLP Findings, 2023

work page 2023
[42]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha,A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, arXiv preprint arXiv:2402.07927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

A survey of prompt engineering meth- ods in large language models for different nlp tasks,

S. Vatsal and H. Dubey,A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks, arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024
[44]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL: https://openreview.net/ forum?id=Ep0TtjVoap

work page 2024
[45]

M. T. Ribeiro, S. Singh, and C. Guestrin,Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, In Proceedings of ACL 2020, pp. 4902–4912, 2020. doi:10.18653/v1/2020.acl-main.442 30

work page doi:10.18653/v1/2020.acl-main.442 2020

[1] [1]

Ali-Saraeb1/AI-Algorithmic-Number-Theory: Artificial Intelligence in Number Theory: LLMs for Algorithm Generation (v2.3.4)

Ali Saraeb. Ali-Saraeb1/AI-Algorithmic-Number-Theory: Artificial Intelligence in Number Theory: LLMs for Algorithm Generation (v2.3.4). Zenodo, 2025. https://doi.org/10.5281/zenodo.15293187

work page doi:10.5281/zenodo.15293187 2025

[2] [2]

librosa/librosa: 0.6.3,

Ali Saraeb. Ali-Saraeb1/AI-Analytic-Number-Theory: Artificial Intelligence in Number Theory: Ensem- ble Methods for Conjecture Verification (v1.0.2). Zenodo, 2025. https://doi.org/10.5281/zenodo. 15460772

work page doi:10.5281/zenodo 2025

[3] [3]

Allal, A

J. Allal, A. Li, C. Zheng, and Z. Yang. Redefining the benchmark to evaluate code-generating LLMs. In Findings of EMNLP 2024, 2024

work page 2024

[4] [4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901

[5] [6]

M. Chen, J. Tworek, H. Jun, Q. Yuan, B. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, and I. Sutskever. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [7]

The Llama 3 Herd of Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [8]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra. Solving quantitative reasoning problems with language models (Minerva). In ICLR, 2022

work page 2022

[8] [9]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS 2021 Datasets and Benchmarks Track, 2021

work page 2021

[9] [10]

Kushman, Y

N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 271–281, 2014

work page 2014

[10] [11]

Huang, S

D. Huang, S. Shi, C.-Y. Lin, J. Yin, and W.-Y. Ma. How well do computers solve math word problems? Large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 887–896, 2016

work page 2016

[11] [12]

H. Liu, Z. Zheng, Y. Qiao, H. Duan, Z. Fei, F. Zhou, W. Zhang, S. Zhang, D. Lin, and K. Chen. MathBench: Evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark. arXiv:2405.12209 [cs.CL], 2024

work page arXiv 2024

[12] [13]

Y. Mao, Y. Kim, and Y. Zhou. CHAMP: A competition-level dataset for fine-grained analyses of LLMs’ mathematical reasoning capabilities. In Findings of the ACL 2024 Conference, pages 1–12, 2024

work page 2024

[13] [14]

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin. Large language models for mathematical reasoning: Progresses and challenges. In Proceedings of the Student Research Workshop at EACL 2024, pages 225–237, 2024

work page 2024

[14] [15]

Satpute, N

A. Satpute, N. Giessing, A. Greiner-Petter, M. Schubotz, O. Teschke, A. Aizawa, and B. Gipp. Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, DC, USA, 2024. 28

work page 2024

[15] [16]

Agrawal, P

V. Agrawal, P. Singla, A. S. Miglani, S. Garg, and A. Mangal. Give me a hint: Can LLMs take a hint to solve math problems? arXiv preprint arXiv:2410.05915, 2024

work page arXiv 2024

[16] [17]

J. Fu, S. Huangfu, H. Yan, S.-K. Ng, and X. Qiu. Hint-before-solving prompting: Guiding LLMs to effectively utilize encoded knowledge. arXiv preprint arXiv:2402.14310, 2024

work page arXiv 2024

[17] [18]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. arXiv:1905.13319 [cs.CL], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[18] [19]

Patel, S

A. Patel, S. Bhattamishra, and N. Goyal. Are NLP models really able to solve simple math word problems? In NAACL ’21: Proceedings of the 2021 Conference of the North American Chapter of the ACL, pages 2080–2094, Online, 2021

work page 2021

[19] [20]

O. Shanker. Neural network prediction of Riemann zeta zeros. Advanced Modeling and Optimization , 14(3):717–728, 2012

work page 2012

[20] [21]

Alessandretti, A

L. Alessandretti, A. Baronchelli, and Y. H. He. ML meets Number Theory: The Data Science of Birch–Swinnerton–Dyer. arXiv:1911.02008 [math.NT], 2019

work page arXiv 1911

[21] [22]

He, K.-H

Y.-H. He, K.-H. Lee, and T. Oliver. Machine-learning the Sato–Tate conjecture. Journal of Symbolic Computation, 111:61–72, 2022

work page 2022

[22] [23]

Bach and J

E. Bach and J. O. Shallit. Algorithmic Number Theory: Efficient Algorithms, Vol. 1. MIT Press, Cambridge, MA, 1996

work page 1996

[23] [24]

J. P. Buhler and P. Stevenhagen (Eds). Algorithmic Number Theory: Lattices, Number Fields, Curves and Cryptography. MSRI Publications, Vol. 44. Cambridge University Press, Cambridge, 2008

work page 2008

[24] [25]

Cohen, A Course in Computational Algebraic Number Theory, Springer, 1993

H. Cohen, A Course in Computational Algebraic Number Theory, Springer, 1993

work page 1993

[25] [26]

Crandall and C

R. Crandall and C. Pomerance, Prime Numbers: A Computational Perspective, Springer, 2001

work page 2001

[26] [27]

Shoup, A Computational Introduction to Number Theory and Algebra, Cambridge University Press, 2009

V. Shoup, A Computational Introduction to Number Theory and Algebra, Cambridge University Press, 2009

work page 2009

[27] [28]

T. M. Apostol,Introduction to Analytic Number Theory, Undergraduate Texts in Mathematics, Springer-Verlag, New York, 1976

work page 1976

[28] [29]

Davenport, Multiplicative Number Theory, 3rd ed., Graduate Texts in Mathematics, vol

H. Davenport, Multiplicative Number Theory, 3rd ed., Graduate Texts in Mathematics, vol. 74, Springer-Verlag, New York, 2000

work page 2000

[29] [30]

E. C. Titchmarsh, The Theory of the Riemann Zeta-Function, 2nd ed., revised by D. R. Heath-Brown, Oxford University Press, Oxford, 1986

work page 1986

[30] [31]

The LMFDB Collaboration, The L-functions and Modular Forms Database , https://www.lmfdb.org, accessed April 2025

work page 2025

[31] [32]

A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, K. Lu, M. Xue, R. Lin, T. Liu, X. Ren, and Z. Zhang, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [33]

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [34]

GPT-4 Technical Report

OpenAI, GPT-4 Technical Report, CoRR, vol. abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [35]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Nature, 323(6088):533–536, 1986. 29

work page 1986

[35] [36]

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001

work page 2001

[36] [37]

T. G. Dietterich. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems (MCS) , volume 1857 of Lecture Notes in Computer Science , pages 1–15. Springer, 2000

work page 2000

[37] [38]

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems(Vol. 30, pp. 3146–3154)

work page 2017

[38] [39]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, et al. MiniF2F: A Benchmark of Formalized Competition Mathematics. In ICLR, 2022

work page 2022

[39] [40]

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. In ACL, pages 158–167, 2017

work page 2017

[40] [41]

W. Chen, M. Yin, M. Ku, P. Lu, and Y. Wan. MathEval: A Comprehensive Benchmark for Mathematical Reasoning. In EMNLP Findings, 2023

work page 2023

[41] [42]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha,A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, arXiv preprint arXiv:2402.07927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [43]

A survey of prompt engineering meth- ods in large language models for different nlp tasks,

S. Vatsal and H. Dubey,A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks, arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024

[43] [44]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL: https://openreview.net/ forum?id=Ep0TtjVoap

work page 2024

[44] [45]

M. T. Ribeiro, S. Singh, and C. Guestrin,Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, In Proceedings of ACL 2020, pp. 4902–4912, 2020. doi:10.18653/v1/2020.acl-main.442 30

work page doi:10.18653/v1/2020.acl-main.442 2020