pith. sign in

arxiv: 2605.19723 · v1 · pith:X2F6UCDInew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmathematical reasoningbenchmarksevaluation metricsreasoning architecturessurveyfailure modes
0
0 comments X

The pith

A review of about 120 studies maps the progress and persistent gaps in large language models for mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic survey of roughly 120 studies on how large language models handle mathematical reasoning. It organizes datasets into a taxonomy based on their roles in pretraining, fine-tuning, and evaluation at different levels of complexity. The authors analyze various architectures and strategies such as tool integration and verifier-guided approaches, then compare evaluation metrics to show the difference between final-answer accuracy and actual process verification. They identify repeated problems including unfaithful reasoning, biased benchmarks, and weak generalization to new problems. The work ends by pointing to needed advances in symbolic grounding and more reliable evaluation methods for trustworthy systems.

Core claim

Through its unified taxonomy of datasets and analysis of architectures and metrics, the paper establishes that current large language models show gains in final-answer accuracy on mathematical tasks yet frequently fail at faithful step-by-step reasoning, suffer from benchmark biases, and generalize poorly, requiring targeted improvements in symbolic integration and process-level verification.

What carries the argument

The unified analytical framework that classifies mathematical datasets by usage stage and reasoning complexity while comparing training strategies such as tool integration and verifier guidance.

If this is right

  • Metrics focused on process verification rather than final answers would expose more accurate pictures of model capability.
  • Architectures that incorporate tools or verifiers improve robustness compared with standard fine-tuning alone.
  • Benchmark biases must be reduced before performance claims can be trusted across different problem distributions.
  • Greater emphasis on symbolic grounding would help close the gap between surface accuracy and reliable reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same faithfulness and generalization issues likely appear in non-mathematical reasoning domains such as logical inference or scientific hypothesis generation.
  • The taxonomy could serve as a template for creating new evaluation sets that deliberately test for reasoning faithfulness across varying difficulty levels.
  • Developers might prioritize hybrid systems that combine language models with external symbolic solvers to address the limitations identified here.

Load-bearing premise

The selection of roughly 120 studies captures the main patterns in the field without major omissions or selection bias that would hide contradictory results.

What would settle it

A controlled study showing large language models that produce correct mathematical answers through fully traceable and faithful reasoning steps on a wide range of previously unseen problem types would contradict the reported recurring failure modes.

Figures

Figures reproduced from arXiv: 2605.19723 by Aamir Shahzad, Husnain Amjad, Mehwish Fatima, Raja Khurram Shahzad.

Figure 1
Figure 1. Figure 1: Top: Math word problem. Bottom: Step-by-step erroneous solution. Input Question: Dane’s two daughters need to plant a certain number of flowers each to grow a garden. As the days passed, the flowers grew into 20 more but 10 of them died. Dane’s daughters harvested the flowers and split them between 5 different baskets, with 4 flowers in each basket. How many flowers did each daughter plant initially? Answe… view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual landscape of research on mathematical reasoning in large language [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PRISMA flow diagram of the systematic literature review selection process [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Challenge pipeline summarizing the interconnected limitations affecting math [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
read the original abstract

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper is a survey synthesizing advancements in mathematical reasoning for LLMs. It reviews approximately 120 studies on datasets, architectures, training strategies (including tool integration and verifier-guided reasoning), and evaluation protocols; introduces a unified taxonomy distinguishing pretraining corpora, supervised fine-tuning resources, and benchmarks by reasoning complexity; compares metrics to highlight gaps between final-answer accuracy and process-level verification; identifies recurring failure modes such as reasoning faithfulness issues, benchmark biases, and generalization limitations; and outlines future directions for symbolic grounding and trustworthy systems.

Significance. If the corpus selection proves representative and the taxonomy robust, the work supplies a consolidated analytical framework that organizes disparate findings, clarifies progress versus limitations, and could serve as a reference for researchers working on LLM reasoning benchmarks and architectures.

major comments (1)
  1. [Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.
minor comments (1)
  1. [Taxonomy section] The unified taxonomy of mathematical datasets would be clearer with explicit examples or a table contrasting pretraining corpora, SFT resources, and evaluation benchmarks at different complexity levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our survey manuscript. We have carefully reviewed the major comment and provide a point-by-point response below. We agree that greater methodological transparency is warranted and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.

    Authors: We acknowledge that this observation is correct and that the manuscript would benefit from explicit documentation of the review process. While the abstract and relevant section describe the scope as encompassing approximately 120 studies, they do not detail the search strategy, criteria, or handling of conflicting results. In the revised manuscript we will add a new subsection titled 'Review Methodology' immediately following the introduction of the taxonomy. This subsection will specify: search databases (arXiv, ACL Anthology, NeurIPS/ICLR proceedings, Google Scholar), keywords and Boolean search strings (e.g., (LLM OR 'large language model') AND ('mathematical reasoning' OR 'math word problems' OR 'chain-of-thought')), date range (primarily January 2020 through submission date), inclusion criteria (peer-reviewed or high-quality preprints with empirical LLM evaluations on mathematical tasks), exclusion criteria (non-English works, purely theoretical papers without experiments, duplicate reports), and our approach to contradictory findings (prioritizing recent rigorous evaluations while explicitly noting and discussing divergent results in the failure-modes section). These additions will allow readers to better evaluate potential curation effects without changing the core synthesis or taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey organizes external results without self-referential derivations

full rationale

This paper is a systematic review and synthesis of approximately 120 existing peer-reviewed studies and preprints on mathematical reasoning in LLMs. It introduces a taxonomy of datasets, analyzes architectures and strategies from the literature, compares metrics, and identifies recurring failure modes reported across those works. No original equations, fitted parameters, predictions, or derivations are presented that could reduce to the paper's own inputs by construction. The central claims rest on reporting and organizing findings from independent external sources rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The selection of studies is an acknowledged methodological choice but does not create circularity under the defined patterns, as the paper does not claim to derive new quantities from its own analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The survey's framework rests on assumptions about the representativeness of the selected literature and the utility of the proposed taxonomy without new empirical validation of that taxonomy.

axioms (1)
  • domain assumption The approximately 120 selected studies are representative of the broader field of mathematical reasoning in LLMs.
    Invoked when claiming to identify recurring failure modes and key research directions from the reviewed set.
invented entities (1)
  • Unified taxonomy of mathematical datasets no independent evidence
    purpose: To distinguish pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across levels of reasoning complexity.
    Introduced as a new organizational structure in the survey.

pith-pipeline@v0.9.0 · 5772 in / 1229 out tokens · 43307 ms · 2026-05-20T05:14:48.628481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 1 internal anchor

  1. [1]

    2025 , eprint=

    Structured Prompting Enables More Robust Evaluation of Language Models , author=. 2025 , eprint=

  2. [2]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Constrained Policy Optimization , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

  3. [3]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =

    Large Language Models for Mathematical Reasoning: Progresses and Challenges , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =. doi:10.18653/v1/2024.eacl-srw.17 , pages =

  4. [4]

    Proceedings of the 24th Interaction Design and Children , pages =

    Anton, Jacqueline and Cosentino, Giulia and Sharma, Kshitij and Gelsomini, Mirko and Mok, Micah and Giannakos, Michail and Abrahamson, Dor , title =. Proceedings of the 24th Interaction Design and Children , pages =. 2025 , isbn =

  5. [5]

    2003 , publisher=

    Mathematical Markup Language (MathML) Version 2.0 , author=. 2003 , publisher=

  6. [6]

    Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

    Alavi Naeini, Saeid and Saqur, Raeid and Saeidi, Mozhgan and Giorgi, John and Taati, Babak , booktitle =. Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

  7. [7]

    2023 , eprint=

    ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=

  8. [8]

    2020 , journal =

    Byte Pair Encoding is Suboptimal for Language Model Pretraining , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , month = nov, year =. doi:10.18653/v1/2020.findings-emnlp.414 , pages =

  9. [9]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and others , booktitle =. Language Models are Few-Shot Learners , url =

  10. [10]

    The Privacy Onion Effect: Memorization is Relative , url =

    Carlini, Nicholas and Jagielski, Matthew and Zhang, Chiyuan and Papernot, Nicolas and Terzis, Andreas and Tramer, Florian , booktitle =. The Privacy Onion Effect: Memorization is Relative , url =

  11. [11]

    Large Language Models are few(1)-shot Table Reasoners

    Large Language Models are few(1)-shot Table Reasoners , author =. Findings of the Association for Computational Linguistics: EACL 2023 , month = may, year =. doi:10.18653/v1/2023.findings-eacl.83 , pages =

  12. [12]

    2025 , address =

    Chernyshev, Konstantin and Polshkov, Vitaliy and Stepanov, Vlad and Myasnikov, Alex and Artemova, Ekaterina and Miasnikov, Alexei and Tilga, Sergei , booktitle =. 2025 , address =

  13. [13]

    Journal of Machine Learning Research , year =

    Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and others , title =. Journal of Machine Learning Research , year =

  14. [14]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  15. [15]

    QLoRA: Efficient Finetuning of Quantized LLMs , url =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

  16. [16]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , address =. doi:10.18653/v1/N19-1423 , pages =

  17. [17]

    Nature , volume=

    The language of mathematics: making the invisible visible , author=. Nature , volume=. 1998 , publisher=

  18. [18]

    Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =

    Didolkar, Aniket and Goyal, Anirudh and Ke, Nan Rosemary and Guo, Siyuan and Valko, Michal and Lillicrap, Timothy and Rezende, Danilo and Bengio, Yoshua and Mozer, Michael and Arora, Sanjeev , booktitle =. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =. doi:10.52202/079017-0623 , pages =

  19. [19]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

    Sparse Low-rank Adaptation of Pre-trained Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2023.emnlp-main.252 , pages =

  20. [20]

    2024 , address =

    Dou, Shihan and Zhou, Enyu and Liu, Yan and Gao, Songyang and Shen, Wei and Xiong, Limao and Zhou, Yuhao and Wang, Xiao and Xi, Zhiheng and Fan, Xiaoran and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.106 , pages =

  21. [21]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =

    Duan, Nan and Tang, Duyu and Zhou, Ming , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =. doi:10.18653/v1/2020.emnlp-tutorials.1 , url =

  22. [22]

    doi:10.1038/s41597-025-05283-3 , url =

    Fang, Meng and Wan, Xiangpeng and Lu, Fei and Xing, Fei and Zou, Kai , date =. doi:10.1038/s41597-025-05283-3 , url =

  23. [23]

    1963 , pages =

    Computers and Thought , publisher =. 1963 , pages =

  24. [24]

    Polylogarithmic-time deterministic network decomposition and distributed derandomization , booktitle =

    Feldman, Vitaly , title =. 2020 , isbn =. doi:10.1145/3357713.3384290 , booktitle =

  25. [25]

    2025 , eprint=

    A Survey on Mathematical Reasoning and Optimization with Large Language Models , author=. 2025 , eprint=

  26. [26]

    2025 , school =

    Improving Complex Reasoning in Large Language Models , author =. 2025 , school =. doi:10.7488/era/6083 , url =

  27. [27]

    2026 , eprint=

    Reward Shaping to Mitigate Reward Hacking in RLHF , author=. 2026 , eprint=

  28. [28]

    2024 , eprint=

    Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models , author=. 2024 , eprint=

  29. [29]

    NeurIPS 2023 AI for Science Workshop , year=

    xVal: A Continuous Number Encoding for Large Language Models , author=. NeurIPS 2023 AI for Science Workshop , year=

  30. [30]

    A survey on dataset quality in machine learning , journal =

    Youdi Gong and Guangzhen Liu and Yunzhi Xue and Rui Li and Lingzhong Meng , keywords =. A survey on dataset quality in machine learning , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.infsof.2023.107268 , url =

  31. [31]

    2025 , eprint=

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author=. 2025 , eprint=

  32. [32]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

    Reward Reasoning Models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

  33. [33]

    2024 , isbn =

    Han, Zhiguang and Wang, Zijian , title =. 2024 , isbn =. doi:10.1145/3688864.3689149 , booktitle =

  34. [34]

    O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , pages =

  35. [35]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  36. [36]

    , author=

    Challenges in Assessing Mathematical Reasoning. , author=. Mathematics Education Research Group of Australasia , year=

  37. [37]

    Australian Journal of Teacher Education , volume =

    Herbert, Sandra , title =. Australian Journal of Teacher Education , volume =. 2021 , doi =

  38. [38]

    2021 , eprint=

    Scaling Laws for Transfer , author=. 2021 , eprint=

  39. [39]

    An empirical analysis of compute-optimal large language model training , url =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks and others , booktitle =. An empirical analysis of compute-optimal large language model training , url =

  40. [40]

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (

    Learning to Solve Arithmetic Word Problems with Verb Categorization , author =. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =. doi:10.3115/v1/D14-1058 , pages =

  41. [41]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  42. [42]

    Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =

    Towards Reasoning in Large Language Models: A Survey , author =. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.67 , pages =

  43. [43]

    2025 , eprint=

    MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations , author=. 2025 , eprint=

  44. [44]

    2024 , eprint=

    O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , author=. 2024 , eprint=

  45. [45]

    M ath P rompter: Mathematical reasoning using large language models

    Imani, Shima and Du, Liang and Shrivastava, Harsh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.acl-industry.4 , pages =

  46. [46]

    Survey of Hallucination in Natural Language Generation

    Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =

  47. [47]

    2025 , eprint=

    MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion , author=. 2025 , eprint=

  48. [48]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  49. [49]

    Intelligent Automation & Soft Computing , publisher =

    Karra, Rachid and Lasfar, Abdelali , title =. Intelligent Automation & Soft Computing , publisher =. 2023 , doi =

  50. [50]

    1990 , isbn =

    Kline, Morris , title =. 1990 , isbn =

  51. [51]

    MAWPS : A math word problem repository

    Koncel-Kedziorski, Rik and Roy, Subhro and Amini, Aida and Kushman, Nate and Hajishirzi, Hannaneh , booktitle =. 2016 , address =. doi:10.18653/v1/N16-1136 , pages =

  52. [52]

    Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

    MCAT Math Retrieval System for NTCIR-12 MathIR Task , author=. Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

  53. [53]

    2022 , issue_date =

    Kukreja, Vinay and Sakshi , title =. 2022 , issue_date =. doi:10.1007/s11042-022-12644-2 , journal =

  54. [54]

    International Conference on Learning Representations , year=

    Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=

  55. [55]

    Solving Quantitative Reasoning Problems with Language Models , url =

    Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and others , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url =

  56. [56]

    2025 , isbn =

    Li, Cheng and Fei, Xiaoyu and Yang, Xiaoyu , title =. 2025 , isbn =. doi:10.1145/3746709.3746759 , booktitle =

  57. [57]

    CAMEL: Communicative Agents for

    Li, Guohao and Hammoud, Hasan and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =. CAMEL: Communicative Agents for

  58. [58]

    Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

    Li, Siyue , booktitle=. Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

  59. [59]

    2023 , eprint=

    Label Supervised LLaMA Finetuning , author=. 2023 , eprint=

  60. [60]

    Authorea Preprints , year=

    Low-Rank Adaptation for Scalable Large Language Models: A Comprehensive Survey , author=. Authorea Preprints , year=

  61. [61]

    Transformer Circuits Thread , url=

    On the biology of a large language model (2025) , author=. Transformer Circuits Thread , url=

  62. [62]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=. 2412.19437 , archivePrefix=

  63. [63]

    2023 , isbn =

    Liu, Jiayu and Huang, Zhenya and Ma, Zhiyuan and Liu, Qi and Chen, Enhong and Su, Tianhuang and Liu, Haifeng , title =. 2023 , isbn =. doi:10.1145/3580305.3599375 , booktitle =

  64. [64]

    International Conference on Machine Learning , year=

    DoRA: Weight-Decomposed Low-Rank Adaptation , author=. International Conference on Machine Learning , year=

  65. [65]

    2025 , issue_date =

    Liu, Wentao and Hu, Hanglei and Zhou, Jie and Ding, Yuyang and Li, Junsong and Zeng, Jiayi and He, Mengliang and Chen, Qin and Jiang, Bo and Zhou, Aimin and He, Liang , title =. 2025 , issue_date =. doi:10.1145/3773985 , journal =

  66. [66]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

    Entity-Based Knowledge Conflicts in Question Answering , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.565 , pages =

  67. [67]

    2021 , address =

    Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun , booktitle =. 2021 , address =. doi:10.18653/v1/2021.acl-long.528 , pages =

  68. [68]

    and Wu, Jian and Giles, C

    Mansouri, Behrooz and Rohatgi, Shaurya and Oard, Douglas W. and Wu, Jian and Giles, C. Lee and Zanibbi, Richard , title =. Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '19) , year =. doi:10.1145/3341981.3344235 , isbn =

  69. [69]

    International Journal of Emerging Technologies in Learning (iJET) , volume =

    Matzakos, Nikolaos and Doukakis, Spyridon and Moundridou, Maria , title =. International Journal of Emerging Technologies in Learning (iJET) , volume =. 2023 , doi =

  70. [70]

    A Diverse Corpus for Evaluating and Developing

    Miao, Shen-yun and Liang, Chao-Chun and Su, Keh-Yih , booktitle =. A Diverse Corpus for Evaluating and Developing. 2020 , address =. doi:10.18653/v1/2020.acl-main.92 , pages =

  71. [71]

    InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =

    Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , booktitle =. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =. doi:10.52202/079017-4270 , editor =

  72. [72]

    2022 , address =

    Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2022 , address =. doi:10.18653/v1/2022.naacl-main.201 , pages =

  73. [73]

    Mishra, M

    Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.392 , pages =

  74. [74]

    Rule Based Rewards for Language Model Safety , url =

    Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian , booktitle =. Rule Based Rewards for Language Model Safety , url =. doi:10.52202/079017-3457 , pages =

  75. [75]

    Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =

    Investigating Symbolic Capabilities of Large Language Models , author =. Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =. 2024 , publisher =

  76. [76]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama and others , booktitle =. Training language models to follow instructions with human feedback , url =

  77. [77]

    2022 , eprint=

    Learning from Few Examples: A Summary of Approaches to Few-Shot Learning , author=. 2022 , eprint=

  78. [78]

    2023 , eprint=

    OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

  79. [79]

    2021 , eprint=

    MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , author=. 2021 , eprint=

  80. [80]

    Pourpanah, Farhad and Abdar, Moloud and Luo, Yuxuan and Zhou, Xinlei and Wang, Ran and Lim, Chee Peng and Wang, Xi-Zhao and Wu, Q. M. Jonathan , journal=. A Review of Generalized Zero-Shot Learning Methods , year=

Showing first 80 references.