pith. the verified trust layer for science. sign in

arxiv: 2309.17452 · v4 · pith:2FFNLPMKnew · submitted 2023-09-29 · 💻 cs.CL · cs.AI

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Pith reviewed 2026-05-19 09:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords mathematical reasoningtool integrationlarge language modelsimitation learningreasoning agentsMATH benchmarkhybrid reasoning
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{2FFNLPMK}

Prints a linked pith:2FFNLPMK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

ToRA agents combine language reasoning with external tool calls to solve complex math problems at new levels for open models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ToRA, a series of models that interleave natural language reasoning steps with calls to external tools such as computation libraries and symbolic solvers. Training relies on curating interactive trajectories of tool use on math datasets followed by imitation learning and output space shaping. The central claim is that this hybrid approach lets models achieve substantially higher accuracy on mathematical reasoning benchmarks than prior open-source systems of any size. A reader would care because it offers a concrete path past the calculation and verification limits that pure text-based models encounter on competition-level problems.

Core claim

By training on curated interactive tool-use trajectories and applying imitation learning with output shaping, ToRA models integrate natural language reasoning with external tools to outperform open-source baselines on ten mathematical reasoning datasets, delivering 13 to 19 percent absolute gains on average; ToRA-7B reaches 44.6 percent on the MATH dataset while ToRA-Code-34B exceeds 50 percent and surpasses GPT-4 chain-of-thought performance.

What carries the argument

The Tool-integrated Reasoning Agent that interleaves language reasoning steps with calls to external tools such as code interpreters and symbolic solvers.

If this is right

  • Open models as small as 7B parameters can exceed the math performance of 70B models trained without tools.
  • An open-source model can reach above 50 percent accuracy on the MATH benchmark for the first time.
  • Hybrid reasoning that mixes text and tool calls becomes competitive with closed models using programs on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-curation plus imitation approach could be tested on non-math reasoning domains that also benefit from precise external verification.
  • Future work might examine whether the learned tool-calling patterns remain effective when the underlying solver libraries are updated or replaced.
  • Scaling the method to larger base models or richer tool sets might further close the gap with frontier closed systems.

Load-bearing premise

The interactive tool-use trajectories collected during data curation supply high-quality supervision that imitation learning can generalize to unseen math problems.

What would settle it

Training a ToRA-style model on the same trajectories and measuring no accuracy gain over a strong baseline on a fresh set of competition math problems not seen during trajectory collection.

read the original abstract

Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ToRA, a series of Tool-integrated Reasoning Agents that combine natural language reasoning with external tools such as computation libraries and symbolic solvers to solve complex mathematical problems. Training involves curating interactive tool-use trajectories on datasets like GSM8K and MATH, followed by imitation learning and output space shaping. The models demonstrate substantial gains over open-source baselines on 10 mathematical reasoning datasets, including ToRA-7B achieving 44.6% accuracy on MATH (surpassing WizardMath-70B by 22%) and ToRA-Code-34B exceeding 50% on MATH, competitive with GPT-4.

Significance. If the results hold and the improvements stem from robust tool integration rather than memorization of trajectory patterns, this would be a significant contribution to mathematical reasoning in LLMs by showing how tool use can be effectively integrated via imitation learning. The work highlights the potential for open-source models to approach or exceed proprietary model performance on competition-level math problems and provides analysis of tool interaction benefits and challenges.

major comments (2)
  1. [Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.
  2. [Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.
minor comments (2)
  1. [Abstract] Abstract: The statement that ToRA-Code-34B is 'the first open-source model' to exceed 50% on MATH should explicitly list the prior open-source models considered in the comparison.
  2. [Tables] Tables: Ensure all result tables include standard deviations or confidence intervals alongside accuracy metrics to support the cross-scale and cross-dataset claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the contributions of tool integration and strengthens the reporting of our results. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.

    Authors: We appreciate the referee's point on distinguishing tool integration from potential pattern memorization. Section 3.2 describes the GPT-4-based trajectory curation and output-space shaping, while Section 5 analyzes tool-use benefits through case studies and error breakdowns showing improved handling of computation and symbolic steps. To directly address the concern, we will add a new ablation in the revised manuscript: we will generate parallel trajectory sets that preserve reasoning content but modify or remove the specific tool-call syntax, then compare resulting model performance to isolate the contribution of the tool format. revision: yes

  2. Referee: [Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.

    Authors: We agree that expanded experimental details improve reproducibility. In the revision we will add full specifications of training hyperparameters, data splits, and implementation choices. For variance, we will report any repeated-run statistics we can obtain and discuss consistency of gains across scales and datasets. However, running multiple independent random seeds for every model size and all ten datasets is computationally prohibitive given the resources needed to train up to 34B models; we will explicitly note this limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents an empirical method: curating tool-use trajectories via GPT-4 on training splits of public datasets, applying imitation learning, and reporting accuracy on held-out test sets of MATH, GSM8K and eight other standard benchmarks. All performance numbers (e.g., ToRA-7B at 44.6% on MATH) are direct measurements against independent external baselines such as WizardMath-70B and GPT-4. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claim is therefore a set of falsifiable experimental outcomes rather than a derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that high-quality tool-use trajectories can be curated and that imitation learning on them produces generalizable reasoning improvements; no new physical entities are postulated.

free parameters (1)
  • model scale choices (7B, 34B)
    Selected to demonstrate scaling behavior and compare against existing open-source baselines of different sizes.
axioms (1)
  • domain assumption Imitation learning on curated tool-use trajectories transfers to improved performance on unseen mathematical problems.
    Invoked when claiming that training on the collected trajectories yields the reported gains across datasets.

pith-pipeline@v0.9.0 · 5783 in / 1348 out tokens · 31781 ms · 2026-05-19T09:14:36.772008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  2. FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

    cs.CV 2025-06 unverdicted novelty 7.0

    FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.

  3. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  4. When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.

  5. Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

    cs.AI 2025-10 unverdicted novelty 6.0

    TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.

  6. Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    cs.AI 2025-07 conditional novelty 6.0

    Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

  7. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  8. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    cs.LG 2024-06 conditional novelty 6.0

    Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.

  9. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  10. LLMs with in-context learning for Algorithmic Theoretical Physics

    cs.LG 2026-05 unverdicted novelty 5.0

    Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.

  11. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  12. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  13. DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    cs.SE 2024-01 unverdicted novelty 5.0

    DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.

  14. Rethinking Wireless Communications through Formal Mathematical AI Reasoning

    eess.SP 2026-04 unverdicted novelty 4.0

    Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.

  15. Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

    cs.CL 2026-04 unverdicted novelty 4.0

    AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.

  16. Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

    cs.CL 2026-03 unverdicted novelty 4.0

    Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.

  17. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  18. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 18 Pith papers · 23 internal anchors

  1. [1]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  2. [2]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022

  3. [3]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco T \' u lio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR, abs/2303.12712, 2023. doi:10.48550/arXiv.2303.12712. U...

  4. [4]

    Model compression

    Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.\ 535--541, 2006

  5. [5]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  7. [7]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

  8. [8]

    Computers and thought, volume 7

    Edward A Feigenbaum, Julian Feldman, et al. Computers and thought, volume 7. New York McGraw-Hill, 1963

  9. [9]

    Specializing smaller language models towards multi-step reasoning

    Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Pr...

  10. [10]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

  11. [11]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

  12. [12]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  13. [13]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  14. [14]

    Large language models are reasoning teachers

    Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.830. URL https://aclanthology.or...

  15. [15]

    Learning to solve arithmetic word problems with verb categorization

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 523--533, 2014

  16. [16]

    Large Language Models Can Self-Improve

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. CoRR, abs/2210.11610, 2022. doi:10.48550/arXiv.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610

  17. [17]

    Backward reasoning in large language models for verification

    Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James T Kwok. Backward reasoning in large language models for verification. arXiv preprint arXiv:2308.07758, 2023

  18. [18]

    MAWPS : A math word problem repository

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational L...

  19. [19]

    Platypus: Quick, cheap, and powerful refinement of llms

    Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023

  20. [20]

    Making language models better reasoners with step-aware verifier

    Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023

  21. [21]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023

  22. [22]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

  23. [23]

    Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

    Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DHyHRBwJUTN

  24. [24]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

  25. [25]

    Augmented Language Models: a Survey

    Gr \'e goire Mialon, Roberto Dess \` , Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi \`e re, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

  26. [26]

    A diverse corpus for evaluating and developing E nglish math word problem solvers

    Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclantholog...

  27. [27]

    Lila: A unified benchmark for mathematical reasoning

    Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

  28. [28]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  29. [29]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  30. [30]

    ART: Automatic multi-step reasoning and tool-use for large language models

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

  31. [31]

    Talm: Tool augmented language models

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

  32. [32]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2...

  33. [33]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116

  34. [34]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  35. [35]

    Generative language modeling for automated theorem proving

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

  36. [36]

    Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--14, 2021

  37. [37]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  38. [38]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  39. [39]

    Chaining simultaneous thoughts for numerical reasoning

    Zhihong Shao, Fei Huang, and Minlie Huang. Chaining simultaneous thoughts for numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pp.\ 2533--2547. Association for Computational Linguistics, 2022. doi:10....

  40. [40]

    Synthetic prompting: Generating chain-of-thought demonstrations for large language models

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023...

  41. [41]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. CoRR, abs/2305.15294, 2023 b . doi:10.48550/arXiv.2305.15294. URL https://doi.org/10.48550/arXiv.2305.15294

  42. [42]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  43. [43]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

  44. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

  45. [45]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

  46. [46]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

  47. [47]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023

  48. [48]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023

  49. [49]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

  50. [50]

    Evaluating and improving tool-augmented computation-intensive math reasoning

    Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint arXiv:2306.02408, 2023

  51. [51]

    Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

    Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023 a

  52. [52]

    Le, and Ed H

    Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....

  53. [53]

    Solving math word problems via cooperative reasoning induced language models

    Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4471--4485, Toronto, Canada, July 2023. Association...