arxiv: 2309.17452 · v4 · pith:2FFNLPMKnew · submitted 2023-09-29 · 💻 cs.CL · cs.AI

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou , Zhihong Shao , Yeyun Gong , Yelong Shen , Yujiu Yang , Minlie Huang , Nan Duan , Weizhu Chen This is my paper

Pith reviewed 2026-05-19 09:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mathematical reasoningtool integrationlarge language modelsimitation learningreasoning agentsMATH benchmarkhybrid reasoning

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{2FFNLPMK}

Prints a linked pith:2FFNLPMK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

ToRA agents combine language reasoning with external tool calls to solve complex math problems at new levels for open models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ToRA, a series of models that interleave natural language reasoning steps with calls to external tools such as computation libraries and symbolic solvers. Training relies on curating interactive trajectories of tool use on math datasets followed by imitation learning and output space shaping. The central claim is that this hybrid approach lets models achieve substantially higher accuracy on mathematical reasoning benchmarks than prior open-source systems of any size. A reader would care because it offers a concrete path past the calculation and verification limits that pure text-based models encounter on competition-level problems.

Core claim

By training on curated interactive tool-use trajectories and applying imitation learning with output shaping, ToRA models integrate natural language reasoning with external tools to outperform open-source baselines on ten mathematical reasoning datasets, delivering 13 to 19 percent absolute gains on average; ToRA-7B reaches 44.6 percent on the MATH dataset while ToRA-Code-34B exceeds 50 percent and surpasses GPT-4 chain-of-thought performance.

What carries the argument

The Tool-integrated Reasoning Agent that interleaves language reasoning steps with calls to external tools such as code interpreters and symbolic solvers.

If this is right

Open models as small as 7B parameters can exceed the math performance of 70B models trained without tools.
An open-source model can reach above 50 percent accuracy on the MATH benchmark for the first time.
Hybrid reasoning that mixes text and tool calls becomes competitive with closed models using programs on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-curation plus imitation approach could be tested on non-math reasoning domains that also benefit from precise external verification.
Future work might examine whether the learned tool-calling patterns remain effective when the underlying solver libraries are updated or replaced.
Scaling the method to larger base models or richer tool sets might further close the gap with frontier closed systems.

Load-bearing premise

The interactive tool-use trajectories collected during data curation supply high-quality supervision that imitation learning can generalize to unseen math problems.

What would settle it

Training a ToRA-style model on the same trajectories and measuring no accuracy gain over a strong baseline on a fresh set of competition math problems not seen during trajectory collection.

read the original abstract

Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToRA gets solid benchmark lifts from tool trajectories and imitation learning, but the gains could partly reflect memorized calling patterns rather than robust integration.

read the letter

ToRA gets solid benchmark lifts from tool trajectories and imitation learning, but the gains could partly reflect memorized calling patterns rather than robust integration. The 7B model reaches 44.6% on MATH and the 34B code version clears 50%, beating prior open-source models including much larger ones, with average gains of 13-19% across ten datasets. That is the main result worth noting right away. The approach curates interactive trajectories by prompting GPT-4 on training splits, filters them, then applies imitation learning plus output space shaping to encourage the model to interleave reasoning with tool calls like code execution or symbolic solvers. This pipeline is the concrete addition over earlier tool-augmented LLM work. The numbers are consistent and the comparison to GPT-4 CoT and program-based solving is useful. The soft spot is exactly the one flagged in the stress test. Because trajectories are generated on the same datasets later used for testing, the model may be learning specific syntactic patterns or recovery loops that happen to correlate with problem templates rather than general tool use. No ablation keeps the reasoning content but strips or alters the tool-call format, so it is hard to tell how much of the delta survives when the model has to discover tool usage on its own. The curation and filtering steps are also described at a high level, which makes reproducibility checks harder. This paper is for groups working on hybrid LLM-plus-tool systems for quantitative reasoning in math, science, or education. Readers tracking MATH and GSM8K progress will find the numbers worth examining even if they want more controls. I would send it to peer review; the empirical improvements are large enough that referees should see the details and request the missing ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ToRA, a series of Tool-integrated Reasoning Agents that combine natural language reasoning with external tools such as computation libraries and symbolic solvers to solve complex mathematical problems. Training involves curating interactive tool-use trajectories on datasets like GSM8K and MATH, followed by imitation learning and output space shaping. The models demonstrate substantial gains over open-source baselines on 10 mathematical reasoning datasets, including ToRA-7B achieving 44.6% accuracy on MATH (surpassing WizardMath-70B by 22%) and ToRA-Code-34B exceeding 50% on MATH, competitive with GPT-4.

Significance. If the results hold and the improvements stem from robust tool integration rather than memorization of trajectory patterns, this would be a significant contribution to mathematical reasoning in LLMs by showing how tool use can be effectively integrated via imitation learning. The work highlights the potential for open-source models to approach or exceed proprietary model performance on competition-level math problems and provides analysis of tool interaction benefits and challenges.

major comments (2)

[Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.
[Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.

minor comments (2)

[Abstract] Abstract: The statement that ToRA-Code-34B is 'the first open-source model' to exceed 50% on MATH should explicitly list the prior open-source models considered in the comparison.
[Tables] Tables: Ensure all result tables include standard deviations or confidence intervals alongside accuracy metrics to support the cross-scale and cross-dataset claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the contributions of tool integration and strengthens the reporting of our results. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.

Authors: We appreciate the referee's point on distinguishing tool integration from potential pattern memorization. Section 3.2 describes the GPT-4-based trajectory curation and output-space shaping, while Section 5 analyzes tool-use benefits through case studies and error breakdowns showing improved handling of computation and symbolic steps. To directly address the concern, we will add a new ablation in the revised manuscript: we will generate parallel trajectory sets that preserve reasoning content but modify or remove the specific tool-call syntax, then compare resulting model performance to isolate the contribution of the tool format. revision: yes
Referee: [Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.

Authors: We agree that expanded experimental details improve reproducibility. In the revision we will add full specifications of training hyperparameters, data splits, and implementation choices. For variance, we will report any repeated-run statistics we can obtain and discuss consistency of gains across scales and datasets. However, running multiple independent random seeds for every model size and all ten datasets is computationally prohibitive given the resources needed to train up to 34B models; we will explicitly note this limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents an empirical method: curating tool-use trajectories via GPT-4 on training splits of public datasets, applying imitation learning, and reporting accuracy on held-out test sets of MATH, GSM8K and eight other standard benchmarks. All performance numbers (e.g., ToRA-7B at 44.6% on MATH) are direct measurements against independent external baselines such as WizardMath-70B and GPT-4. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claim is therefore a set of falsifiable experimental outcomes rather than a derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that high-quality tool-use trajectories can be curated and that imitation learning on them produces generalizable reasoning improvements; no new physical entities are postulated.

free parameters (1)

model scale choices (7B, 34B)
Selected to demonstrate scaling behavior and compare against existing open-source baselines of different sizes.

axioms (1)

domain assumption Imitation learning on curated tool-use trajectories transfers to improved performance on unseen mathematical problems.
Invoked when claiming that training on the collected trajectories yields the reported gains across datasets.

pith-pipeline@v0.9.0 · 5783 in / 1348 out tokens · 31781 ms · 2026-05-19T09:14:36.772008+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
cs.CV 2025-06 unverdicted novelty 7.0

FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
cs.AI 2025-10 unverdicted novelty 6.0

TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
cs.AI 2025-07 conditional novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
cs.LG 2024-06 conditional novelty 6.0

Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
LLMs with in-context learning for Algorithmic Theoretical Physics
cs.LG 2026-05 unverdicted novelty 5.0

Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
cs.CL 2026-04 unverdicted novelty 4.0

AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
cs.CL 2026-03 unverdicted novelty 4.0

Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 18 Pith papers · 23 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022

work page 2022
[3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco T \' u lio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR, abs/2303.12712, 2023. doi:10.48550/arXiv.2303.12712. U...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023
[4]

Model compression

Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.\ 535--541, 2006

work page 2006
[5]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023
[8]

Computers and thought, volume 7

Edward A Feigenbaum, Julian Feldman, et al. Computers and thought, volume 7. New York McGraw-Hill, 1963

work page 1963
[9]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Pr...

work page 2023
[10]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[13]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Large language models are reasoning teachers

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.830. URL https://aclanthology.or...

work page doi:10.18653/v1/2023.acl-long.830 2023
[15]

Learning to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 523--533, 2014

work page 2014
[16]

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. CoRR, abs/2210.11610, 2022. doi:10.48550/arXiv.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11610 2022
[17]

Backward reasoning in large language models for verification

Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James T Kwok. Backward reasoning in large language models for verification. arXiv preprint arXiv:2308.07758, 2023

work page arXiv 2023
[18]

MAWPS : A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational L...

work page doi:10.18653/v1/n16-1136 2016
[19]

Platypus: Quick, cheap, and powerful refinement of llms

Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023

work page arXiv 2023
[20]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023

work page 2023
[21]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DHyHRBwJUTN

work page 2023
[24]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Augmented Language Models: a Survey

Gr \'e goire Mialon, Roberto Dess \` , Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi \`e re, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

A diverse corpus for evaluating and developing E nglish math word problem solvers

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclantholog...

work page doi:10.18653/v1/2020.acl-main.92 2020
[27]

Lila: A unified benchmark for mathematical reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

work page 2022
[28]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[30]

ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Talm: Tool augmented language models

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[32]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[33]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Generative language modeling for automated theorem proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page arXiv 2009
[36]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--14, 2021

work page 2021
[37]

Code Llama: Open Foundation Models for Code

Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Chaining simultaneous thoughts for numerical reasoning

Zhihong Shao, Fei Huang, and Minlie Huang. Chaining simultaneous thoughts for numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pp.\ 2533--2547. Association for Computational Linguistics, 2022. doi:10....

work page doi:10.18653/v1/2022.findings-emnlp.187 2022
[40]

Synthetic prompting: Generating chain-of-thought demonstrations for large language models

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023...

work page 2023
[41]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. CoRR, abs/2305.15294, 2023 b . doi:10.48550/arXiv.2305.15294. URL https://doi.org/10.48550/arXiv.2305.15294

work page doi:10.48550/arxiv.2305.15294 2023
[42]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[45]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

work page 2022
[46]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[47]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022
[50]

Evaluating and improving tool-augmented computation-intensive math reasoning

Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint arXiv:2306.02408, 2023

work page arXiv 2023
[51]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023 a

work page arXiv 2023
[52]

Le, and Ed H

Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....

work page 2023
[53]

Solving math word problems via cooperative reasoning induced language models

Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4471--4485, Toronto, Canada, July 2023. Association...

work page doi:10.18653/v1/2023.acl-long.245 2023