Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Aamir Shahzad; Husnain Amjad; Mehwish Fatima; Raja Khurram Shahzad

arxiv: 2605.19723 · v1 · pith:X2F6UCDInew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Husnain Amjad , Raja Khurram Shahzad , Aamir Shahzad , Mehwish Fatima This is my paper

Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsmathematical reasoningbenchmarksevaluation metricsreasoning architecturessurveyfailure modes

0 comments

The pith

A review of about 120 studies maps the progress and persistent gaps in large language models for mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic survey of roughly 120 studies on how large language models handle mathematical reasoning. It organizes datasets into a taxonomy based on their roles in pretraining, fine-tuning, and evaluation at different levels of complexity. The authors analyze various architectures and strategies such as tool integration and verifier-guided approaches, then compare evaluation metrics to show the difference between final-answer accuracy and actual process verification. They identify repeated problems including unfaithful reasoning, biased benchmarks, and weak generalization to new problems. The work ends by pointing to needed advances in symbolic grounding and more reliable evaluation methods for trustworthy systems.

Core claim

Through its unified taxonomy of datasets and analysis of architectures and metrics, the paper establishes that current large language models show gains in final-answer accuracy on mathematical tasks yet frequently fail at faithful step-by-step reasoning, suffer from benchmark biases, and generalize poorly, requiring targeted improvements in symbolic integration and process-level verification.

What carries the argument

The unified analytical framework that classifies mathematical datasets by usage stage and reasoning complexity while comparing training strategies such as tool integration and verifier guidance.

If this is right

Metrics focused on process verification rather than final answers would expose more accurate pictures of model capability.
Architectures that incorporate tools or verifiers improve robustness compared with standard fine-tuning alone.
Benchmark biases must be reduced before performance claims can be trusted across different problem distributions.
Greater emphasis on symbolic grounding would help close the gap between surface accuracy and reliable reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same faithfulness and generalization issues likely appear in non-mathematical reasoning domains such as logical inference or scientific hypothesis generation.
The taxonomy could serve as a template for creating new evaluation sets that deliberately test for reasoning faithfulness across varying difficulty levels.
Developers might prioritize hybrid systems that combine language models with external symbolic solvers to address the limitations identified here.

Load-bearing premise

The selection of roughly 120 studies captures the main patterns in the field without major omissions or selection bias that would hide contradictory results.

What would settle it

A controlled study showing large language models that produce correct mathematical answers through fully traceable and faithful reasoning steps on a wide range of previously unseen problem types would contradict the reported recurring failure modes.

Figures

Figures reproduced from arXiv: 2605.19723 by Aamir Shahzad, Husnain Amjad, Mehwish Fatima, Raja Khurram Shahzad.

**Figure 1.** Figure 1: Top: Math word problem. Bottom: Step-by-step erroneous solution. Input Question: Dane’s two daughters need to plant a certain number of flowers each to grow a garden. As the days passed, the flowers grew into 20 more but 10 of them died. Dane’s daughters harvested the flowers and split them between 5 different baskets, with 4 flowers in each basket. How many flowers did each daughter plant initially? Answe… view at source ↗

**Figure 2.** Figure 2: Conceptual landscape of research on mathematical reasoning in large language [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: PRISMA flow diagram of the systematic literature review selection process [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Challenge pipeline summarizing the interconnected limitations affecting math [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

read the original abstract

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward survey that organizes existing LLM math reasoning work into a taxonomy and flags common gaps, but its synthesis of failure modes stands or falls on how the 120 papers were picked.

read the letter

This survey pulls together work on how large language models handle mathematical reasoning. It mainly organizes datasets, architectures, and evaluation issues rather than adding fresh experiments or proofs. The central contribution is a unified taxonomy that splits datasets into pretraining corpora, supervised fine-tuning sets, and benchmarks at different complexity levels, plus a review of strategies like tool integration and verifier-guided reasoning. That structure can make the scattered literature easier to navigate for someone new to the area. The discussion of metric shortcomings, where final-answer accuracy often outpaces checks on actual reasoning steps, also lines up with points raised elsewhere. The paper does a reasonable job of collecting these threads in one place without overclaiming novelty. The softer spot is the claim that the review of roughly 120 studies reliably surfaces recurring problems such as reasoning faithfulness failures and benchmark biases. The abstract gives no search strings, date cutoffs, or explicit rules for handling papers that contradict the main patterns, so it is difficult to judge whether the identified issues reflect the broader literature or the authors' selection. If the full text spells out those criteria and shows balanced coverage, that concern shrinks; otherwise the synthesis risks looking curated. This paper suits researchers who want an entry point or a map of open directions like better symbolic grounding, rather than readers seeking new methods or tight empirical results. It is not groundbreaking but can save time for people scanning the field. I would send it to peer review so referees can check the review methodology and whether the taxonomy actually clarifies more than it restates.

Referee Report

1 major / 1 minor

Summary. The paper is a survey synthesizing advancements in mathematical reasoning for LLMs. It reviews approximately 120 studies on datasets, architectures, training strategies (including tool integration and verifier-guided reasoning), and evaluation protocols; introduces a unified taxonomy distinguishing pretraining corpora, supervised fine-tuning resources, and benchmarks by reasoning complexity; compares metrics to highlight gaps between final-answer accuracy and process-level verification; identifies recurring failure modes such as reasoning faithfulness issues, benchmark biases, and generalization limitations; and outlines future directions for symbolic grounding and trustworthy systems.

Significance. If the corpus selection proves representative and the taxonomy robust, the work supplies a consolidated analytical framework that organizes disparate findings, clarifies progress versus limitations, and could serve as a reference for researchers working on LLM reasoning benchmarks and architectures.

major comments (1)

[Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.

minor comments (1)

[Taxonomy section] The unified taxonomy of mathematical datasets would be clearer with explicit examples or a table contrasting pretraining corpora, SFT resources, and evaluation benchmarks at different complexity levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our survey manuscript. We have carefully reviewed the major comment and provide a point-by-point response below. We agree that greater methodological transparency is warranted and will revise accordingly.

read point-by-point responses

Referee: [Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.

Authors: We acknowledge that this observation is correct and that the manuscript would benefit from explicit documentation of the review process. While the abstract and relevant section describe the scope as encompassing approximately 120 studies, they do not detail the search strategy, criteria, or handling of conflicting results. In the revised manuscript we will add a new subsection titled 'Review Methodology' immediately following the introduction of the taxonomy. This subsection will specify: search databases (arXiv, ACL Anthology, NeurIPS/ICLR proceedings, Google Scholar), keywords and Boolean search strings (e.g., (LLM OR 'large language model') AND ('mathematical reasoning' OR 'math word problems' OR 'chain-of-thought')), date range (primarily January 2020 through submission date), inclusion criteria (peer-reviewed or high-quality preprints with empirical LLM evaluations on mathematical tasks), exclusion criteria (non-English works, purely theoretical papers without experiments, duplicate reports), and our approach to contradictory findings (prioritizing recent rigorous evaluations while explicitly noting and discussing divergent results in the failure-modes section). These additions will allow readers to better evaluate potential curation effects without changing the core synthesis or taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey organizes external results without self-referential derivations

full rationale

This paper is a systematic review and synthesis of approximately 120 existing peer-reviewed studies and preprints on mathematical reasoning in LLMs. It introduces a taxonomy of datasets, analyzes architectures and strategies from the literature, compares metrics, and identifies recurring failure modes reported across those works. No original equations, fitted parameters, predictions, or derivations are presented that could reduce to the paper's own inputs by construction. The central claims rest on reporting and organizing findings from independent external sources rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The selection of studies is an acknowledged methodological choice but does not create circularity under the defined patterns, as the paper does not claim to derive new quantities from its own analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The survey's framework rests on assumptions about the representativeness of the selected literature and the utility of the proposed taxonomy without new empirical validation of that taxonomy.

axioms (1)

domain assumption The approximately 120 selected studies are representative of the broader field of mathematical reasoning in LLMs.
Invoked when claiming to identify recurring failure modes and key research directions from the reviewed set.

invented entities (1)

Unified taxonomy of mathematical datasets no independent evidence
purpose: To distinguish pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across levels of reasoning complexity.
Introduced as a new organizational structure in the survey.

pith-pipeline@v0.9.0 · 5772 in / 1229 out tokens · 43307 ms · 2026-05-20T05:14:48.628481+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our systematic review encompasses approximately 120 peer-reviewed studies and preprints... unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification... recurring failure modes, such as reasoning faithfulness issues, benchmark biases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 1 internal anchor

[1]

2025 , eprint=

Structured Prompting Enables More Robust Evaluation of Language Models , author=. 2025 , eprint=

work page 2025
[2]

Proceedings of the 34th International Conference on Machine Learning , pages =

Constrained Policy Optimization , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

work page 2017
[3]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =

Large Language Models for Mathematical Reasoning: Progresses and Challenges , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =. doi:10.18653/v1/2024.eacl-srw.17 , pages =

work page doi:10.18653/v1/2024.eacl-srw.17 2024
[4]

Proceedings of the 24th Interaction Design and Children , pages =

Anton, Jacqueline and Cosentino, Giulia and Sharma, Kshitij and Gelsomini, Mirko and Mok, Micah and Giannakos, Michail and Abrahamson, Dor , title =. Proceedings of the 24th Interaction Design and Children , pages =. 2025 , isbn =

work page 2025
[5]

2003 , publisher=

Mathematical Markup Language (MathML) Version 2.0 , author=. 2003 , publisher=

work page 2003
[6]

Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

Alavi Naeini, Saeid and Saqur, Raeid and Saeidi, Mozhgan and Giorgi, John and Taati, Babak , booktitle =. Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

work page
[7]

2023 , eprint=

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=

work page 2023
[8]

2020 , journal =

Byte Pair Encoding is Suboptimal for Language Model Pretraining , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , month = nov, year =. doi:10.18653/v1/2020.findings-emnlp.414 , pages =

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[9]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and others , booktitle =. Language Models are Few-Shot Learners , url =

work page
[10]

The Privacy Onion Effect: Memorization is Relative , url =

Carlini, Nicholas and Jagielski, Matthew and Zhang, Chiyuan and Papernot, Nicolas and Terzis, Andreas and Tramer, Florian , booktitle =. The Privacy Onion Effect: Memorization is Relative , url =

work page
[11]

Large Language Models are few(1)-shot Table Reasoners

Large Language Models are few(1)-shot Table Reasoners , author =. Findings of the Association for Computational Linguistics: EACL 2023 , month = may, year =. doi:10.18653/v1/2023.findings-eacl.83 , pages =

work page doi:10.18653/v1/2023.findings-eacl.83 2023
[12]

2025 , address =

Chernyshev, Konstantin and Polshkov, Vitaliy and Stepanov, Vlad and Myasnikov, Alex and Artemova, Ekaterina and Miasnikov, Alexei and Tilga, Sergei , booktitle =. 2025 , address =

work page 2025
[13]

Journal of Machine Learning Research , year =

Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and others , title =. Journal of Machine Learning Research , year =

work page
[14]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[15]

QLoRA: Efficient Finetuning of Quantized LLMs , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

work page
[16]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , address =. doi:10.18653/v1/N19-1423 , pages =

work page doi:10.18653/v1/n19-1423 2019
[17]

Nature , volume=

The language of mathematics: making the invisible visible , author=. Nature , volume=. 1998 , publisher=

work page 1998
[18]

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =

Didolkar, Aniket and Goyal, Anirudh and Ke, Nan Rosemary and Guo, Siyuan and Valko, Michal and Lillicrap, Timothy and Rezende, Danilo and Bengio, Yoshua and Mozer, Michael and Arora, Sanjeev , booktitle =. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =. doi:10.52202/079017-0623 , pages =

work page doi:10.52202/079017-0623
[19]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

Sparse Low-rank Adaptation of Pre-trained Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2023.emnlp-main.252 , pages =

work page doi:10.18653/v1/2023.emnlp-main.252 2023
[20]

2024 , address =

Dou, Shihan and Zhou, Enyu and Liu, Yan and Gao, Songyang and Shen, Wei and Xiong, Limao and Zhou, Yuhao and Wang, Xiao and Xi, Zhiheng and Fan, Xiaoran and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.106 , pages =

work page doi:10.18653/v1/2024.acl-long.106 2024
[21]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =

Duan, Nan and Tang, Duyu and Zhou, Ming , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =. doi:10.18653/v1/2020.emnlp-tutorials.1 , url =

work page doi:10.18653/v1/2020.emnlp-tutorials.1 2020
[22]

doi:10.1038/s41597-025-05283-3 , url =

Fang, Meng and Wan, Xiangpeng and Lu, Fei and Xing, Fei and Zou, Kai , date =. doi:10.1038/s41597-025-05283-3 , url =

work page doi:10.1038/s41597-025-05283-3
[23]

1963 , pages =

Computers and Thought , publisher =. 1963 , pages =

work page 1963
[24]

Polylogarithmic-time deterministic network decomposition and distributed derandomization , booktitle =

Feldman, Vitaly , title =. 2020 , isbn =. doi:10.1145/3357713.3384290 , booktitle =

work page doi:10.1145/3357713.3384290 2020
[25]

2025 , eprint=

A Survey on Mathematical Reasoning and Optimization with Large Language Models , author=. 2025 , eprint=

work page 2025
[26]

2025 , school =

Improving Complex Reasoning in Large Language Models , author =. 2025 , school =. doi:10.7488/era/6083 , url =

work page doi:10.7488/era/6083 2025
[27]

2026 , eprint=

Reward Shaping to Mitigate Reward Hacking in RLHF , author=. 2026 , eprint=

work page 2026
[28]

2024 , eprint=

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models , author=. 2024 , eprint=

work page 2024
[29]

NeurIPS 2023 AI for Science Workshop , year=

xVal: A Continuous Number Encoding for Large Language Models , author=. NeurIPS 2023 AI for Science Workshop , year=

work page 2023
[30]

A survey on dataset quality in machine learning , journal =

Youdi Gong and Guangzhen Liu and Yunzhi Xue and Rui Li and Lingzhong Meng , keywords =. A survey on dataset quality in machine learning , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.infsof.2023.107268 , url =

work page doi:10.1016/j.infsof.2023.107268 2023
[31]

2025 , eprint=

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author=. 2025 , eprint=

work page 2025
[32]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Reward Reasoning Models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

work page
[33]

2024 , isbn =

Han, Zhiguang and Wang, Zijian , title =. 2024 , isbn =. doi:10.1145/3688864.3689149 , booktitle =

work page doi:10.1145/3688864.3689149 2024
[34]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , pages =

work page doi:10.18653/v1/2024.acl-long.211 2024
[35]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021
[36]

, author=

Challenges in Assessing Mathematical Reasoning. , author=. Mathematics Education Research Group of Australasia , year=

work page
[37]

Australian Journal of Teacher Education , volume =

Herbert, Sandra , title =. Australian Journal of Teacher Education , volume =. 2021 , doi =

work page 2021
[38]

2021 , eprint=

Scaling Laws for Transfer , author=. 2021 , eprint=

work page 2021
[39]

An empirical analysis of compute-optimal large language model training , url =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks and others , booktitle =. An empirical analysis of compute-optimal large language model training , url =

work page
[40]

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (

Learning to Solve Arithmetic Word Problems with Verb Categorization , author =. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =. doi:10.3115/v1/D14-1058 , pages =

work page doi:10.3115/v1/d14-1058 2014
[41]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[42]

Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =

Towards Reasoning in Large Language Models: A Survey , author =. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.67 , pages =

work page doi:10.18653/v1/2023.findings-acl.67 2023
[43]

2025 , eprint=

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations , author=. 2025 , eprint=

work page 2025
[44]

2024 , eprint=

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , author=. 2024 , eprint=

work page 2024
[45]

M ath P rompter: Mathematical reasoning using large language models

Imani, Shima and Du, Liang and Shrivastava, Harsh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.acl-industry.4 , pages =

work page doi:10.18653/v1/2023.acl-industry.4 2023
[46]

Survey of Hallucination in Natural Language Generation

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =

work page doi:10.1145/3571730 2023
[47]

2025 , eprint=

MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion , author=. 2025 , eprint=

work page 2025
[48]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[49]

Intelligent Automation & Soft Computing , publisher =

Karra, Rachid and Lasfar, Abdelali , title =. Intelligent Automation & Soft Computing , publisher =. 2023 , doi =

work page 2023
[50]

1990 , isbn =

Kline, Morris , title =. 1990 , isbn =

work page 1990
[51]

MAWPS : A math word problem repository

Koncel-Kedziorski, Rik and Roy, Subhro and Amini, Aida and Kushman, Nate and Hajishirzi, Hannaneh , booktitle =. 2016 , address =. doi:10.18653/v1/N16-1136 , pages =

work page doi:10.18653/v1/n16-1136 2016
[52]

Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

MCAT Math Retrieval System for NTCIR-12 MathIR Task , author=. Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

work page
[53]

2022 , issue_date =

Kukreja, Vinay and Sakshi , title =. 2022 , issue_date =. doi:10.1007/s11042-022-12644-2 , journal =

work page doi:10.1007/s11042-022-12644-2 2022
[54]

International Conference on Learning Representations , year=

Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=

work page
[55]

Solving Quantitative Reasoning Problems with Language Models , url =

Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and others , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url =

work page
[56]

2025 , isbn =

Li, Cheng and Fei, Xiaoyu and Yang, Xiaoyu , title =. 2025 , isbn =. doi:10.1145/3746709.3746759 , booktitle =

work page doi:10.1145/3746709.3746759 2025
[57]

CAMEL: Communicative Agents for

Li, Guohao and Hammoud, Hasan and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =. CAMEL: Communicative Agents for

work page
[58]

Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

Li, Siyue , booktitle=. Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

work page
[59]

2023 , eprint=

Label Supervised LLaMA Finetuning , author=. 2023 , eprint=

work page 2023
[60]

Authorea Preprints , year=

Low-Rank Adaptation for Scalable Large Language Models: A Comprehensive Survey , author=. Authorea Preprints , year=

work page
[61]

Transformer Circuits Thread , url=

On the biology of a large language model (2025) , author=. Transformer Circuits Thread , url=

work page 2025
[62]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=. 2412.19437 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

2023 , isbn =

Liu, Jiayu and Huang, Zhenya and Ma, Zhiyuan and Liu, Qi and Chen, Enhong and Su, Tianhuang and Liu, Haifeng , title =. 2023 , isbn =. doi:10.1145/3580305.3599375 , booktitle =

work page doi:10.1145/3580305.3599375 2023
[64]

International Conference on Machine Learning , year=

DoRA: Weight-Decomposed Low-Rank Adaptation , author=. International Conference on Machine Learning , year=

work page
[65]

2025 , issue_date =

Liu, Wentao and Hu, Hanglei and Zhou, Jie and Ding, Yuyang and Li, Junsong and Zeng, Jiayi and He, Mengliang and Chen, Qin and Jiang, Bo and Zhou, Aimin and He, Liang , title =. 2025 , issue_date =. doi:10.1145/3773985 , journal =

work page doi:10.1145/3773985 2025
[66]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Entity-Based Knowledge Conflicts in Question Answering , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.565 , pages =

work page doi:10.18653/v1/2021.emnlp-main.565 2021
[67]

2021 , address =

Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun , booktitle =. 2021 , address =. doi:10.18653/v1/2021.acl-long.528 , pages =

work page doi:10.18653/v1/2021.acl-long.528 2021
[68]

and Wu, Jian and Giles, C

Mansouri, Behrooz and Rohatgi, Shaurya and Oard, Douglas W. and Wu, Jian and Giles, C. Lee and Zanibbi, Richard , title =. Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '19) , year =. doi:10.1145/3341981.3344235 , isbn =

work page doi:10.1145/3341981.3344235 2019
[69]

International Journal of Emerging Technologies in Learning (iJET) , volume =

Matzakos, Nikolaos and Doukakis, Spyridon and Moundridou, Maria , title =. International Journal of Emerging Technologies in Learning (iJET) , volume =. 2023 , doi =

work page 2023
[70]

A Diverse Corpus for Evaluating and Developing

Miao, Shen-yun and Liang, Chao-Chun and Su, Keh-Yih , booktitle =. A Diverse Corpus for Evaluating and Developing. 2020 , address =. doi:10.18653/v1/2020.acl-main.92 , pages =

work page doi:10.18653/v1/2020.acl-main.92 2020
[71]

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =

Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , booktitle =. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =. doi:10.52202/079017-4270 , editor =

work page doi:10.52202/079017-4270
[72]

2022 , address =

Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2022 , address =. doi:10.18653/v1/2022.naacl-main.201 , pages =

work page doi:10.18653/v1/2022.naacl-main.201 2022
[73]

Mishra, M

Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.392 , pages =

work page doi:10.18653/v1/2022.emnlp-main.392 2022
[74]

Rule Based Rewards for Language Model Safety , url =

Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian , booktitle =. Rule Based Rewards for Language Model Safety , url =. doi:10.52202/079017-3457 , pages =

work page doi:10.52202/079017-3457
[75]

Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =

Investigating Symbolic Capabilities of Large Language Models , author =. Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =. 2024 , publisher =

work page 2024
[76]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama and others , booktitle =. Training language models to follow instructions with human feedback , url =

work page
[77]

2022 , eprint=

Learning from Few Examples: A Summary of Approaches to Few-Shot Learning , author=. 2022 , eprint=

work page 2022
[78]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

work page 2023
[79]

2021 , eprint=

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , author=. 2021 , eprint=

work page 2021
[80]

Pourpanah, Farhad and Abdar, Moloud and Luo, Yuxuan and Zhou, Xinlei and Wang, Ran and Lim, Chee Peng and Wang, Xi-Zhao and Wu, Q. M. Jonathan , journal=. A Review of Generalized Zero-Shot Learning Methods , year=

work page

Showing first 80 references.

[1] [1]

2025 , eprint=

Structured Prompting Enables More Robust Evaluation of Language Models , author=. 2025 , eprint=

work page 2025

[2] [2]

Proceedings of the 34th International Conference on Machine Learning , pages =

Constrained Policy Optimization , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

work page 2017

[3] [3]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =

Large Language Models for Mathematical Reasoning: Progresses and Challenges , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =. doi:10.18653/v1/2024.eacl-srw.17 , pages =

work page doi:10.18653/v1/2024.eacl-srw.17 2024

[4] [4]

Proceedings of the 24th Interaction Design and Children , pages =

Anton, Jacqueline and Cosentino, Giulia and Sharma, Kshitij and Gelsomini, Mirko and Mok, Micah and Giannakos, Michail and Abrahamson, Dor , title =. Proceedings of the 24th Interaction Design and Children , pages =. 2025 , isbn =

work page 2025

[5] [5]

2003 , publisher=

Mathematical Markup Language (MathML) Version 2.0 , author=. 2003 , publisher=

work page 2003

[6] [6]

Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

Alavi Naeini, Saeid and Saqur, Raeid and Saeidi, Mozhgan and Giorgi, John and Taati, Babak , booktitle =. Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =

work page

[7] [7]

2023 , eprint=

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=

work page 2023

[8] [8]

2020 , journal =

Byte Pair Encoding is Suboptimal for Language Model Pretraining , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , month = nov, year =. doi:10.18653/v1/2020.findings-emnlp.414 , pages =

work page doi:10.18653/v1/2020.findings-emnlp.414 2020

[9] [9]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and others , booktitle =. Language Models are Few-Shot Learners , url =

work page

[10] [10]

The Privacy Onion Effect: Memorization is Relative , url =

Carlini, Nicholas and Jagielski, Matthew and Zhang, Chiyuan and Papernot, Nicolas and Terzis, Andreas and Tramer, Florian , booktitle =. The Privacy Onion Effect: Memorization is Relative , url =

work page

[11] [11]

Large Language Models are few(1)-shot Table Reasoners

Large Language Models are few(1)-shot Table Reasoners , author =. Findings of the Association for Computational Linguistics: EACL 2023 , month = may, year =. doi:10.18653/v1/2023.findings-eacl.83 , pages =

work page doi:10.18653/v1/2023.findings-eacl.83 2023

[12] [12]

2025 , address =

Chernyshev, Konstantin and Polshkov, Vitaliy and Stepanov, Vlad and Myasnikov, Alex and Artemova, Ekaterina and Miasnikov, Alexei and Tilga, Sergei , booktitle =. 2025 , address =

work page 2025

[13] [13]

Journal of Machine Learning Research , year =

Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and others , title =. Journal of Machine Learning Research , year =

work page

[14] [14]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[15] [15]

QLoRA: Efficient Finetuning of Quantized LLMs , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

work page

[16] [16]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , address =. doi:10.18653/v1/N19-1423 , pages =

work page doi:10.18653/v1/n19-1423 2019

[17] [17]

Nature , volume=

The language of mathematics: making the invisible visible , author=. Nature , volume=. 1998 , publisher=

work page 1998

[18] [18]

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =

Didolkar, Aniket and Goyal, Anirudh and Ke, Nan Rosemary and Guo, Siyuan and Valko, Michal and Lillicrap, Timothy and Rezende, Danilo and Bengio, Yoshua and Mozer, Michael and Arora, Sanjeev , booktitle =. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =. doi:10.52202/079017-0623 , pages =

work page doi:10.52202/079017-0623

[19] [19]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

Sparse Low-rank Adaptation of Pre-trained Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2023.emnlp-main.252 , pages =

work page doi:10.18653/v1/2023.emnlp-main.252 2023

[20] [20]

2024 , address =

Dou, Shihan and Zhou, Enyu and Liu, Yan and Gao, Songyang and Shen, Wei and Xiong, Limao and Zhou, Yuhao and Wang, Xiao and Xi, Zhiheng and Fan, Xiaoran and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.106 , pages =

work page doi:10.18653/v1/2024.acl-long.106 2024

[21] [21]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =

Duan, Nan and Tang, Duyu and Zhou, Ming , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =. doi:10.18653/v1/2020.emnlp-tutorials.1 , url =

work page doi:10.18653/v1/2020.emnlp-tutorials.1 2020

[22] [22]

doi:10.1038/s41597-025-05283-3 , url =

Fang, Meng and Wan, Xiangpeng and Lu, Fei and Xing, Fei and Zou, Kai , date =. doi:10.1038/s41597-025-05283-3 , url =

work page doi:10.1038/s41597-025-05283-3

[23] [23]

1963 , pages =

Computers and Thought , publisher =. 1963 , pages =

work page 1963

[24] [24]

Polylogarithmic-time deterministic network decomposition and distributed derandomization , booktitle =

Feldman, Vitaly , title =. 2020 , isbn =. doi:10.1145/3357713.3384290 , booktitle =

work page doi:10.1145/3357713.3384290 2020

[25] [25]

2025 , eprint=

A Survey on Mathematical Reasoning and Optimization with Large Language Models , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , school =

Improving Complex Reasoning in Large Language Models , author =. 2025 , school =. doi:10.7488/era/6083 , url =

work page doi:10.7488/era/6083 2025

[27] [27]

2026 , eprint=

Reward Shaping to Mitigate Reward Hacking in RLHF , author=. 2026 , eprint=

work page 2026

[28] [28]

2024 , eprint=

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models , author=. 2024 , eprint=

work page 2024

[29] [29]

NeurIPS 2023 AI for Science Workshop , year=

xVal: A Continuous Number Encoding for Large Language Models , author=. NeurIPS 2023 AI for Science Workshop , year=

work page 2023

[30] [30]

A survey on dataset quality in machine learning , journal =

Youdi Gong and Guangzhen Liu and Yunzhi Xue and Rui Li and Lingzhong Meng , keywords =. A survey on dataset quality in machine learning , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.infsof.2023.107268 , url =

work page doi:10.1016/j.infsof.2023.107268 2023

[31] [31]

2025 , eprint=

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author=. 2025 , eprint=

work page 2025

[32] [32]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Reward Reasoning Models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

work page

[33] [33]

2024 , isbn =

Han, Zhiguang and Wang, Zijian , title =. 2024 , isbn =. doi:10.1145/3688864.3689149 , booktitle =

work page doi:10.1145/3688864.3689149 2024

[34] [34]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , pages =

work page doi:10.18653/v1/2024.acl-long.211 2024

[35] [35]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021

[36] [36]

, author=

Challenges in Assessing Mathematical Reasoning. , author=. Mathematics Education Research Group of Australasia , year=

work page

[37] [37]

Australian Journal of Teacher Education , volume =

Herbert, Sandra , title =. Australian Journal of Teacher Education , volume =. 2021 , doi =

work page 2021

[38] [38]

2021 , eprint=

Scaling Laws for Transfer , author=. 2021 , eprint=

work page 2021

[39] [39]

An empirical analysis of compute-optimal large language model training , url =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks and others , booktitle =. An empirical analysis of compute-optimal large language model training , url =

work page

[40] [40]

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (

Learning to Solve Arithmetic Word Problems with Verb Categorization , author =. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =. doi:10.3115/v1/D14-1058 , pages =

work page doi:10.3115/v1/d14-1058 2014

[41] [41]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021

[42] [42]

Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =

Towards Reasoning in Large Language Models: A Survey , author =. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.67 , pages =

work page doi:10.18653/v1/2023.findings-acl.67 2023

[43] [43]

2025 , eprint=

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations , author=. 2025 , eprint=

work page 2025

[44] [44]

2024 , eprint=

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , author=. 2024 , eprint=

work page 2024

[45] [45]

M ath P rompter: Mathematical reasoning using large language models

Imani, Shima and Du, Liang and Shrivastava, Harsh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.acl-industry.4 , pages =

work page doi:10.18653/v1/2023.acl-industry.4 2023

[46] [46]

Survey of Hallucination in Natural Language Generation

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =

work page doi:10.1145/3571730 2023

[47] [47]

2025 , eprint=

MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion , author=. 2025 , eprint=

work page 2025

[48] [48]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020

[49] [49]

Intelligent Automation & Soft Computing , publisher =

Karra, Rachid and Lasfar, Abdelali , title =. Intelligent Automation & Soft Computing , publisher =. 2023 , doi =

work page 2023

[50] [50]

1990 , isbn =

Kline, Morris , title =. 1990 , isbn =

work page 1990

[51] [51]

MAWPS : A math word problem repository

Koncel-Kedziorski, Rik and Roy, Subhro and Amini, Aida and Kushman, Nate and Hajishirzi, Hannaneh , booktitle =. 2016 , address =. doi:10.18653/v1/N16-1136 , pages =

work page doi:10.18653/v1/n16-1136 2016

[52] [52]

Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

MCAT Math Retrieval System for NTCIR-12 MathIR Task , author=. Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=

work page

[53] [53]

2022 , issue_date =

Kukreja, Vinay and Sakshi , title =. 2022 , issue_date =. doi:10.1007/s11042-022-12644-2 , journal =

work page doi:10.1007/s11042-022-12644-2 2022

[54] [54]

International Conference on Learning Representations , year=

Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=

work page

[55] [55]

Solving Quantitative Reasoning Problems with Language Models , url =

Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and others , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url =

work page

[56] [56]

2025 , isbn =

Li, Cheng and Fei, Xiaoyu and Yang, Xiaoyu , title =. 2025 , isbn =. doi:10.1145/3746709.3746759 , booktitle =

work page doi:10.1145/3746709.3746759 2025

[57] [57]

CAMEL: Communicative Agents for

Li, Guohao and Hammoud, Hasan and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =. CAMEL: Communicative Agents for

work page

[58] [58]

Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

Li, Siyue , booktitle=. Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=

work page

[59] [59]

2023 , eprint=

Label Supervised LLaMA Finetuning , author=. 2023 , eprint=

work page 2023

[60] [60]

Authorea Preprints , year=

Low-Rank Adaptation for Scalable Large Language Models: A Comprehensive Survey , author=. Authorea Preprints , year=

work page

[61] [61]

Transformer Circuits Thread , url=

On the biology of a large language model (2025) , author=. Transformer Circuits Thread , url=

work page 2025

[62] [62]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=. 2412.19437 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

2023 , isbn =

Liu, Jiayu and Huang, Zhenya and Ma, Zhiyuan and Liu, Qi and Chen, Enhong and Su, Tianhuang and Liu, Haifeng , title =. 2023 , isbn =. doi:10.1145/3580305.3599375 , booktitle =

work page doi:10.1145/3580305.3599375 2023

[64] [64]

International Conference on Machine Learning , year=

DoRA: Weight-Decomposed Low-Rank Adaptation , author=. International Conference on Machine Learning , year=

work page

[65] [65]

2025 , issue_date =

Liu, Wentao and Hu, Hanglei and Zhou, Jie and Ding, Yuyang and Li, Junsong and Zeng, Jiayi and He, Mengliang and Chen, Qin and Jiang, Bo and Zhou, Aimin and He, Liang , title =. 2025 , issue_date =. doi:10.1145/3773985 , journal =

work page doi:10.1145/3773985 2025

[66] [66]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Entity-Based Knowledge Conflicts in Question Answering , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.565 , pages =

work page doi:10.18653/v1/2021.emnlp-main.565 2021

[67] [67]

2021 , address =

Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun , booktitle =. 2021 , address =. doi:10.18653/v1/2021.acl-long.528 , pages =

work page doi:10.18653/v1/2021.acl-long.528 2021

[68] [68]

and Wu, Jian and Giles, C

Mansouri, Behrooz and Rohatgi, Shaurya and Oard, Douglas W. and Wu, Jian and Giles, C. Lee and Zanibbi, Richard , title =. Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '19) , year =. doi:10.1145/3341981.3344235 , isbn =

work page doi:10.1145/3341981.3344235 2019

[69] [69]

International Journal of Emerging Technologies in Learning (iJET) , volume =

Matzakos, Nikolaos and Doukakis, Spyridon and Moundridou, Maria , title =. International Journal of Emerging Technologies in Learning (iJET) , volume =. 2023 , doi =

work page 2023

[70] [70]

A Diverse Corpus for Evaluating and Developing

Miao, Shen-yun and Liang, Chao-Chun and Su, Keh-Yih , booktitle =. A Diverse Corpus for Evaluating and Developing. 2020 , address =. doi:10.18653/v1/2020.acl-main.92 , pages =

work page doi:10.18653/v1/2020.acl-main.92 2020

[71] [71]

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =

Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , booktitle =. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =. doi:10.52202/079017-4270 , editor =

work page doi:10.52202/079017-4270

[72] [72]

2022 , address =

Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2022 , address =. doi:10.18653/v1/2022.naacl-main.201 , pages =

work page doi:10.18653/v1/2022.naacl-main.201 2022

[73] [73]

Mishra, M

Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.392 , pages =

work page doi:10.18653/v1/2022.emnlp-main.392 2022

[74] [74]

Rule Based Rewards for Language Model Safety , url =

Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian , booktitle =. Rule Based Rewards for Language Model Safety , url =. doi:10.52202/079017-3457 , pages =

work page doi:10.52202/079017-3457

[75] [75]

Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =

Investigating Symbolic Capabilities of Large Language Models , author =. Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =. 2024 , publisher =

work page 2024

[76] [76]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama and others , booktitle =. Training language models to follow instructions with human feedback , url =

work page

[77] [77]

2022 , eprint=

Learning from Few Examples: A Summary of Approaches to Few-Shot Learning , author=. 2022 , eprint=

work page 2022

[78] [78]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

work page 2023

[79] [79]

2021 , eprint=

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , author=. 2021 , eprint=

work page 2021

[80] [80]

Pourpanah, Farhad and Abdar, Moloud and Luo, Yuxuan and Zhou, Xinlei and Wang, Ran and Lim, Chee Peng and Wang, Xi-Zhao and Wu, Q. M. Jonathan , journal=. A Review of Generalized Zero-Shot Learning Methods , year=

work page