Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
Pith reviewed 2026-05-20 05:14 UTC · model grok-4.3
The pith
A review of about 120 studies maps the progress and persistent gaps in large language models for mathematical reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through its unified taxonomy of datasets and analysis of architectures and metrics, the paper establishes that current large language models show gains in final-answer accuracy on mathematical tasks yet frequently fail at faithful step-by-step reasoning, suffer from benchmark biases, and generalize poorly, requiring targeted improvements in symbolic integration and process-level verification.
What carries the argument
The unified analytical framework that classifies mathematical datasets by usage stage and reasoning complexity while comparing training strategies such as tool integration and verifier guidance.
If this is right
- Metrics focused on process verification rather than final answers would expose more accurate pictures of model capability.
- Architectures that incorporate tools or verifiers improve robustness compared with standard fine-tuning alone.
- Benchmark biases must be reduced before performance claims can be trusted across different problem distributions.
- Greater emphasis on symbolic grounding would help close the gap between surface accuracy and reliable reasoning.
Where Pith is reading between the lines
- The same faithfulness and generalization issues likely appear in non-mathematical reasoning domains such as logical inference or scientific hypothesis generation.
- The taxonomy could serve as a template for creating new evaluation sets that deliberately test for reasoning faithfulness across varying difficulty levels.
- Developers might prioritize hybrid systems that combine language models with external symbolic solvers to address the limitations identified here.
Load-bearing premise
The selection of roughly 120 studies captures the main patterns in the field without major omissions or selection bias that would hide contradictory results.
What would settle it
A controlled study showing large language models that produce correct mathematical answers through fully traceable and faithful reasoning steps on a wide range of previously unseen problem types would contradict the reported recurring failure modes.
Figures
read the original abstract
Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey synthesizing advancements in mathematical reasoning for LLMs. It reviews approximately 120 studies on datasets, architectures, training strategies (including tool integration and verifier-guided reasoning), and evaluation protocols; introduces a unified taxonomy distinguishing pretraining corpora, supervised fine-tuning resources, and benchmarks by reasoning complexity; compares metrics to highlight gaps between final-answer accuracy and process-level verification; identifies recurring failure modes such as reasoning faithfulness issues, benchmark biases, and generalization limitations; and outlines future directions for symbolic grounding and trustworthy systems.
Significance. If the corpus selection proves representative and the taxonomy robust, the work supplies a consolidated analytical framework that organizes disparate findings, clarifies progress versus limitations, and could serve as a reference for researchers working on LLM reasoning benchmarks and architectures.
major comments (1)
- [Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.
minor comments (1)
- [Taxonomy section] The unified taxonomy of mathematical datasets would be clearer with explicit examples or a table contrasting pretraining corpora, SFT resources, and evaluation benchmarks at different complexity levels.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our survey manuscript. We have carefully reviewed the major comment and provide a point-by-point response below. We agree that greater methodological transparency is warranted and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract and Systematic Review section] The central claim of reliably identifying recurring failure modes (reasoning faithfulness, benchmark biases) rests on the systematic review of ~120 studies, yet the manuscript provides no search strings, inclusion/exclusion criteria, date ranges, or explicit protocol for handling contradictory papers (see Abstract and the section describing the review process). This omission makes it impossible to assess selection bias or confirm that the synthesized patterns reflect the literature distribution rather than curation choices.
Authors: We acknowledge that this observation is correct and that the manuscript would benefit from explicit documentation of the review process. While the abstract and relevant section describe the scope as encompassing approximately 120 studies, they do not detail the search strategy, criteria, or handling of conflicting results. In the revised manuscript we will add a new subsection titled 'Review Methodology' immediately following the introduction of the taxonomy. This subsection will specify: search databases (arXiv, ACL Anthology, NeurIPS/ICLR proceedings, Google Scholar), keywords and Boolean search strings (e.g., (LLM OR 'large language model') AND ('mathematical reasoning' OR 'math word problems' OR 'chain-of-thought')), date range (primarily January 2020 through submission date), inclusion criteria (peer-reviewed or high-quality preprints with empirical LLM evaluations on mathematical tasks), exclusion criteria (non-English works, purely theoretical papers without experiments, duplicate reports), and our approach to contradictory findings (prioritizing recent rigorous evaluations while explicitly noting and discussing divergent results in the failure-modes section). These additions will allow readers to better evaluate potential curation effects without changing the core synthesis or taxonomy. revision: yes
Circularity Check
No circularity: literature survey organizes external results without self-referential derivations
full rationale
This paper is a systematic review and synthesis of approximately 120 existing peer-reviewed studies and preprints on mathematical reasoning in LLMs. It introduces a taxonomy of datasets, analyzes architectures and strategies from the literature, compares metrics, and identifies recurring failure modes reported across those works. No original equations, fitted parameters, predictions, or derivations are presented that could reduce to the paper's own inputs by construction. The central claims rest on reporting and organizing findings from independent external sources rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The selection of studies is an acknowledged methodological choice but does not create circularity under the defined patterns, as the paper does not claim to derive new quantities from its own analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The approximately 120 selected studies are representative of the broader field of mathematical reasoning in LLMs.
invented entities (1)
-
Unified taxonomy of mathematical datasets
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our systematic review encompasses approximately 120 peer-reviewed studies and preprints... unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification... recurring failure modes, such as reasoning faithfulness issues, benchmark biases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Structured Prompting Enables More Robust Evaluation of Language Models , author=. 2025 , eprint=
work page 2025
-
[2]
Proceedings of the 34th International Conference on Machine Learning , pages =
Constrained Policy Optimization , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =
work page 2017
-
[3]
Large Language Models for Mathematical Reasoning: Progresses and Challenges , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , month = mar, year =. doi:10.18653/v1/2024.eacl-srw.17 , pages =
-
[4]
Proceedings of the 24th Interaction Design and Children , pages =
Anton, Jacqueline and Cosentino, Giulia and Sharma, Kshitij and Gelsomini, Mirko and Mok, Micah and Giannakos, Michail and Abrahamson, Dor , title =. Proceedings of the 24th Interaction Design and Children , pages =. 2025 , isbn =
work page 2025
-
[5]
Mathematical Markup Language (MathML) Version 2.0 , author=. 2003 , publisher=
work page 2003
-
[6]
Alavi Naeini, Saeid and Saqur, Raeid and Saeidi, Mozhgan and Giorgi, John and Taati, Babak , booktitle =. Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset , url =
-
[7]
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=
work page 2023
-
[8]
Byte Pair Encoding is Suboptimal for Language Model Pretraining , author =. Findings of the Association for Computational Linguistics: EMNLP 2020 , month = nov, year =. doi:10.18653/v1/2020.findings-emnlp.414 , pages =
-
[9]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and others , booktitle =. Language Models are Few-Shot Learners , url =
-
[10]
The Privacy Onion Effect: Memorization is Relative , url =
Carlini, Nicholas and Jagielski, Matthew and Zhang, Chiyuan and Papernot, Nicolas and Terzis, Andreas and Tramer, Florian , booktitle =. The Privacy Onion Effect: Memorization is Relative , url =
-
[11]
Large Language Models are few(1)-shot Table Reasoners
Large Language Models are few(1)-shot Table Reasoners , author =. Findings of the Association for Computational Linguistics: EACL 2023 , month = may, year =. doi:10.18653/v1/2023.findings-eacl.83 , pages =
-
[12]
Chernyshev, Konstantin and Polshkov, Vitaliy and Stepanov, Vlad and Myasnikov, Alex and Artemova, Ekaterina and Miasnikov, Alexei and Tilga, Sergei , booktitle =. 2025 , address =
work page 2025
-
[13]
Journal of Machine Learning Research , year =
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and others , title =. Journal of Machine Learning Research , year =
-
[14]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[15]
QLoRA: Efficient Finetuning of Quantized LLMs , url =
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =
-
[16]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , address =. doi:10.18653/v1/N19-1423 , pages =
-
[17]
The language of mathematics: making the invisible visible , author=. Nature , volume=. 1998 , publisher=
work page 1998
-
[18]
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =
Didolkar, Aniket and Goyal, Anirudh and Ke, Nan Rosemary and Guo, Siyuan and Valko, Michal and Lillicrap, Timothy and Rezende, Danilo and Bengio, Yoshua and Mozer, Michael and Arora, Sanjeev , booktitle =. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving , url =. doi:10.52202/079017-0623 , pages =
-
[19]
Sparse Low-rank Adaptation of Pre-trained Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2023.emnlp-main.252 , pages =
-
[20]
Dou, Shihan and Zhou, Enyu and Liu, Yan and Gao, Songyang and Shen, Wei and Xiong, Limao and Zhou, Yuhao and Wang, Xiao and Xi, Zhiheng and Fan, Xiaoran and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.106 , pages =
-
[21]
Duan, Nan and Tang, Duyu and Zhou, Ming , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , year =. doi:10.18653/v1/2020.emnlp-tutorials.1 , url =
-
[22]
doi:10.1038/s41597-025-05283-3 , url =
Fang, Meng and Wan, Xiangpeng and Lu, Fei and Xing, Fei and Zou, Kai , date =. doi:10.1038/s41597-025-05283-3 , url =
- [23]
-
[24]
Feldman, Vitaly , title =. 2020 , isbn =. doi:10.1145/3357713.3384290 , booktitle =
-
[25]
A Survey on Mathematical Reasoning and Optimization with Large Language Models , author=. 2025 , eprint=
work page 2025
-
[26]
Improving Complex Reasoning in Large Language Models , author =. 2025 , school =. doi:10.7488/era/6083 , url =
-
[27]
Reward Shaping to Mitigate Reward Hacking in RLHF , author=. 2026 , eprint=
work page 2026
-
[28]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models , author=. 2024 , eprint=
work page 2024
-
[29]
NeurIPS 2023 AI for Science Workshop , year=
xVal: A Continuous Number Encoding for Large Language Models , author=. NeurIPS 2023 AI for Science Workshop , year=
work page 2023
-
[30]
A survey on dataset quality in machine learning , journal =
Youdi Gong and Guangzhen Liu and Yunzhi Xue and Rui Li and Lingzhong Meng , keywords =. A survey on dataset quality in machine learning , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.infsof.2023.107268 , url =
-
[31]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author=. 2025 , eprint=
work page 2025
-
[32]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
Reward Reasoning Models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
-
[33]
Han, Zhiguang and Wang, Zijian , title =. 2024 , isbn =. doi:10.1145/3688864.3689149 , booktitle =
-
[34]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and others , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , pages =
-
[35]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=
work page 2021
- [36]
-
[37]
Australian Journal of Teacher Education , volume =
Herbert, Sandra , title =. Australian Journal of Teacher Education , volume =. 2021 , doi =
work page 2021
- [38]
-
[39]
An empirical analysis of compute-optimal large language model training , url =
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks and others , booktitle =. An empirical analysis of compute-optimal large language model training , url =
-
[40]
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (
Learning to Solve Arithmetic Word Problems with Verb Categorization , author =. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =. doi:10.3115/v1/D14-1058 , pages =
-
[41]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[42]
Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =
Towards Reasoning in Large Language Models: A Survey , author =. Findings of the Association for Computational Linguistics: ACL 2023 , month = jul, year =. doi:10.18653/v1/2023.findings-acl.67 , pages =
-
[43]
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations , author=. 2025 , eprint=
work page 2025
-
[44]
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , author=. 2024 , eprint=
work page 2024
-
[45]
M ath P rompter: Mathematical reasoning using large language models
Imani, Shima and Du, Liang and Shrivastava, Harsh , booktitle =. 2023 , address =. doi:10.18653/v1/2023.acl-industry.4 , pages =
-
[46]
Survey of Hallucination in Natural Language Generation
Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =
-
[47]
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion , author=. 2025 , eprint=
work page 2025
- [48]
-
[49]
Intelligent Automation & Soft Computing , publisher =
Karra, Rachid and Lasfar, Abdelali , title =. Intelligent Automation & Soft Computing , publisher =. 2023 , doi =
work page 2023
- [50]
-
[51]
MAWPS : A math word problem repository
Koncel-Kedziorski, Rik and Roy, Subhro and Amini, Aida and Kushman, Nate and Hajishirzi, Hannaneh , booktitle =. 2016 , address =. doi:10.18653/v1/N16-1136 , pages =
-
[52]
Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=
MCAT Math Retrieval System for NTCIR-12 MathIR Task , author=. Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies , year=
-
[53]
Kukreja, Vinay and Sakshi , title =. 2022 , issue_date =. doi:10.1007/s11042-022-12644-2 , journal =
-
[54]
International Conference on Learning Representations , year=
Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=
-
[55]
Solving Quantitative Reasoning Problems with Language Models , url =
Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and others , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url =
-
[56]
Li, Cheng and Fei, Xiaoyu and Yang, Xiaoyu , title =. 2025 , isbn =. doi:10.1145/3746709.3746759 , booktitle =
-
[57]
CAMEL: Communicative Agents for
Li, Guohao and Hammoud, Hasan and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =. CAMEL: Communicative Agents for
-
[58]
Li, Siyue , booktitle=. Enhancing Mathematical Problem Solving in Large Language Models through Tool-Integrated Reasoning and Python Code Execution , year=
- [59]
-
[60]
Low-Rank Adaptation for Scalable Large Language Models: A Comprehensive Survey , author=. Authorea Preprints , year=
-
[61]
Transformer Circuits Thread , url=
On the biology of a large language model (2025) , author=. Transformer Circuits Thread , url=
work page 2025
-
[62]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=. 2412.19437 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Liu, Jiayu and Huang, Zhenya and Ma, Zhiyuan and Liu, Qi and Chen, Enhong and Su, Tianhuang and Liu, Haifeng , title =. 2023 , isbn =. doi:10.1145/3580305.3599375 , booktitle =
-
[64]
International Conference on Machine Learning , year=
DoRA: Weight-Decomposed Low-Rank Adaptation , author=. International Conference on Machine Learning , year=
-
[65]
Liu, Wentao and Hu, Hanglei and Zhou, Jie and Ding, Yuyang and Li, Junsong and Zeng, Jiayi and He, Mengliang and Chen, Qin and Jiang, Bo and Zhou, Aimin and He, Liang , title =. 2025 , issue_date =. doi:10.1145/3773985 , journal =
-
[66]
Entity-Based Knowledge Conflicts in Question Answering , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2021.emnlp-main.565 , pages =
-
[67]
Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun , booktitle =. 2021 , address =. doi:10.18653/v1/2021.acl-long.528 , pages =
-
[68]
Mansouri, Behrooz and Rohatgi, Shaurya and Oard, Douglas W. and Wu, Jian and Giles, C. Lee and Zanibbi, Richard , title =. Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '19) , year =. doi:10.1145/3341981.3344235 , isbn =
-
[69]
International Journal of Emerging Technologies in Learning (iJET) , volume =
Matzakos, Nikolaos and Doukakis, Spyridon and Moundridou, Maria , title =. International Journal of Emerging Technologies in Learning (iJET) , volume =. 2023 , doi =
work page 2023
-
[70]
A Diverse Corpus for Evaluating and Developing
Miao, Shen-yun and Liang, Chao-Chun and Su, Keh-Yih , booktitle =. A Diverse Corpus for Evaluating and Developing. 2020 , address =. doi:10.18653/v1/2020.acl-main.92 , pages =
-
[71]
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =
Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , booktitle =. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling , url =. doi:10.52202/079017-4270 , editor =
-
[72]
Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh , booktitle =. 2022 , address =. doi:10.18653/v1/2022.naacl-main.201 , pages =
-
[73]
Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin , booktitle =. 2022 , address =. doi:10.18653/v1/2022.emnlp-main.392 , pages =
-
[74]
Rule Based Rewards for Language Model Safety , url =
Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian , booktitle =. Rule Based Rewards for Language Model Safety , url =. doi:10.52202/079017-3457 , pages =
-
[75]
Investigating Symbolic Capabilities of Large Language Models , author =. Proceedings of the First International Workshop on Logical Foundations of Neuro-Symbolic AI (LNSAI 2024) , editor =. 2024 , publisher =
work page 2024
-
[76]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama and others , booktitle =. Training language models to follow instructions with human feedback , url =
-
[77]
Learning from Few Examples: A Summary of Approaches to Few-Shot Learning , author=. 2022 , eprint=
work page 2022
-
[78]
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=
work page 2023
-
[79]
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding , author=. 2021 , eprint=
work page 2021
-
[80]
Pourpanah, Farhad and Abdar, Moloud and Luo, Yuxuan and Zhou, Xinlei and Wang, Ran and Lim, Chee Peng and Wang, Xi-Zhao and Wu, Q. M. Jonathan , journal=. A Review of Generalized Zero-Shot Learning Methods , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.