MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Pith reviewed 2026-05-17 23:42 UTC · model grok-4.3
The pith
Training on a hybrid of chain-of-thought and program-of-thought rationales builds open-source math models that outperform prior leaders on nine benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that instruction tuning on MathInstruct, which mixes chain-of-thought and program-of-thought rationales across thirteen datasets with wide math-field coverage, yields MAmmoTH models that substantially outperform existing open-source models on nine mathematical reasoning datasets at every scale, delivering average accuracy gains of 16 to 32 percent. The 7B version scores 33 percent on MATH, 23 points above the previous best open-source 7B model, while the 34B version scores 44 percent on MATH and surpasses GPT-4's CoT result.
What carries the argument
MathInstruct, the instruction-tuning dataset that presents a hybrid of chain-of-thought and program-of-thought rationales compiled from thirteen math datasets.
If this is right
- Models gain the ability to apply either verbal steps or code execution depending on the math problem.
- The program-of-thought component increases the potential for tool use during reasoning.
- Open-source models reach higher accuracy on competition-level tasks such as MATH.
- Broad coverage across math fields supports stronger generalization to new problems.
Where Pith is reading between the lines
- The same hybrid rationale mix could be applied to scientific reasoning tasks that also mix explanation and simulation.
- Curating high-quality rationales may matter more than raw data volume when specializing models for reasoning.
- Adding verification steps to the program-of-thought outputs could further reduce calculation errors.
- Smaller models trained this way might serve educational tools that need both text explanations and runnable code.
Load-bearing premise
The measured accuracy gains result specifically from the hybrid CoT-PoT format and the newly curated rationales rather than from dataset size, model scale, or other training choices.
What would settle it
Train identical base models on matched volumes of data that contain only CoT rationales, only PoT rationales, or the original uncurated sources, then check whether the reported gains on the nine evaluation datasets disappear.
read the original abstract
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MAmmoTH series of open-source LLMs for general mathematical problem-solving. These models are fine-tuned on MathInstruct, a hybrid instruction dataset compiled from 13 sources (six with newly curated rationales) that mixes chain-of-thought (CoT) and program-of-thought (PoT) formats. The central claim is that this hybrid approach yields substantial gains over prior open-source models on nine math reasoning benchmarks, with average accuracy improvements of 16-32%, including 33% on MATH for the 7B variant (23 points above WizardMath) and 44% for the 34B variant (exceeding GPT-4 CoT).
Significance. If the gains are robustly attributable to the hybrid CoT-PoT format and curated rationales rather than scale or unstated factors, the work would provide a practical recipe for improving mathematical reasoning in open models and underscore the value of diverse rationale styles. The release of models and dataset supports reproducibility and follow-up research.
major comments (1)
- [§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.
minor comments (2)
- [§3.2] §3.2: The description of how the six new rationales were curated could be expanded with explicit quality-control steps or inter-annotator agreement metrics to strengthen the claim of 'meticulously curated'.
- [Figure 1 and §3] Figure 1 and §3: The mixture proportions across the 13 sources are not tabulated; adding a breakdown of example counts or token shares per source would clarify the 'extensive coverage' assertion.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.
Authors: We appreciate the referee's emphasis on isolating the contribution of the hybrid CoT-PoT format and the newly curated rationales. Our primary comparisons are to WizardMath and similar baselines that use the same base models (Llama-2-7B/34B) and comparable fine-tuning setups, with the key distinction being our use of MathInstruct's hybrid rationales versus their predominantly CoT-based data. However, we acknowledge that a more tightly controlled ablation—fixing base model, training schedule, and total token count while varying only PoT inclusion or the six curated sources—would provide stronger attribution. In the revised manuscript, we will add such an ablation study in §4, reporting results for pure-CoT, pure-PoT, and hybrid variants under matched conditions. This will directly support the claims in the abstract and §3 regarding the benefits of hybrid rationales. revision: yes
Circularity Check
No significant circularity; results rest on external benchmarks
full rationale
The paper trains models on the newly compiled MathInstruct mixture and reports accuracy on nine standard held-out mathematical reasoning benchmarks. These evaluation sets are distinct from the training sources, and the reported gains are measured against external baselines rather than being derived from any fitted parameter or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central performance claims to the inputs by construction. The absence of ablations is a limitation on causal attribution but does not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
free parameters (1)
- LLM training hyperparameters
axioms (1)
- domain assumption Instruction tuning on curated datasets with rationales improves LLM reasoning performance
Forward citations
Cited by 20 Pith papers
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
-
EDUMATH: Generating Standards-aligned Educational Math Word Problems
EDUMATH introduces the first teacher-annotated dataset for standards-aligned math word problem generation and demonstrates that it enables smaller open LLMs to match larger models while producing problems students pre...
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
-
CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion
CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Reference graph
Works this paper leans on
-
[1]
M ath QA : Towards interpretable math word problem solving with operation-based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. M ath QA : Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long...
-
[2]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. ArXiv preprint, abs/2305.10403, 2023. URL https://arxiv.org/abs/2305.10403
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588, 2022. URL https://arxiv.org/abs/2211.12588
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Theoremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, and Pan Lu. Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023. URL https://arxiv.org/abs/2305.12524
-
[7]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/2210.11416
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Advancing mathematics by guiding human intuition with ai
Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021. URL https://www.nature.com/articles/s41586-021-04086-x
work page 2021
-
[10]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314, 2023. URL https://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Andrew Drozdov, Nathanael Sch \"a rli, Ekin Aky \"u rek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models. International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=gJW8hSGBys8
work page 2023
-
[12]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f/gao23f.pdf
work page 2023
-
[13]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738, 2023. URL https://arxiv.org/abs/2305.11738
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[15]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 b . URL https://datasets-benchmarks-proceedings.neurips.cc/paper...
work page 2021
-
[16]
Learning to solve arithmetic word problems with verb categorization
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 523--533, 2014. doi:10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058
-
[17]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022
work page 2022
-
[18]
Parsing algebraic word problems into equations
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3: 0 585--597, 2015. doi:10.1162/tacl_a_00160. URL https://aclanthology.org/Q15-1042
-
[19]
MAWPS : A math word problem repository
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, 2016. doi:10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136
-
[20]
Platypus: Quick, cheap, and powerful refinement of llms
Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. ArXiv preprint, abs/2308.07317, 2023. URL https://arxiv.org/abs/2308.07317
-
[21]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022. URL https://openreview.net/pdf?id=IFXTZERXdM7
work page 2022
-
[22]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. ArXiv preprint, abs/2303.17760, 2023 a . URL https://arxiv.org/abs/2303.17760
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Making language models better reasoners with step-aware verifier
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023 b . URL https://aclanthology.org/2023.acl-long.291.pdf
work page 2023
-
[24]
Program induction by rationale generation: Learning to solve and explain algebraic word problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017. doi:10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015
-
[25]
The flan collection: Designing data and methods for effective instruction tuning
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. URL https://openreview.net/pdf?id=ZX4uS605XV
work page 2023
-
[26]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023. URL https://arxiv.org/abs/2308.09583
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Language models of code are few-shot commonsense learners
Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 1384--1403, 2022. URL https://aclanthology.org/2022.emnlp-main.90.pdf
work page 2022
-
[28]
LILA : A unified benchmark for mathematical reasoning
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA : A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5807--5832, 2022 a . URL https://acl...
work page 2022
-
[29]
N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks
Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3505--3523, 2022 b . doi:10.18653/v1/2022....
-
[30]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Codegen: An open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_
work page 2023
-
[32]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022. URL https://arxiv.org/abs/2112.00114
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, 2021. doi:10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.na...
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[35]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint, abs/2306.01116, 2023. URL https://arxiv.org/abs/2306.01116
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277, 2023. URL https://arxiv.org/abs/2304.03277
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727
-
[38]
Solving general arithmetic word problems
Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 1743--1752, 2015. doi:10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202
-
[39]
Code Llama: Open Foundation Models for Code
Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian - Jian Jiang, Han Wang, Matteo Manica,...
work page 2022
-
[41]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022. URL https://arxiv.org/abs/2210.09261
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [42]
-
[43]
Galactica: A Large Language Model for Science
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. ArXiv preprint, abs/2211.09085, 2022. URL https://arxiv.org/abs/2211.09085
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023 a . URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023 b . URL https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Iteratively prompt pre-trained language models for chain of thought
Boshi Wang, Xiang Deng, and Huan Sun. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 2714--2730. Association for Computational Linguistics, 2022 a . URL https://aclanthology.org/2022.emnlp-main.174
work page 2022
-
[47]
Towards understanding chain-of-thought prompting: An empirical study of what matters
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2717--2739. Association for Computational Linguistics, 2023 a...
-
[48]
Boshi Wang, Xiang Yue, and Huan Sun. Can chatgpt defend the truth? automatic dialectical evaluation elicits llms' deficiencies in reasoning. ArXiv preprint, abs/2305.13160, 2023 b . URL https://arxiv.org/abs/2305.13160
-
[49]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv preprint, abs/2305.04091, 2023 c . URL https://arxiv.org/abs/2305.04091
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Making large language models better reasoners with alignment
Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. ArXiv preprint, abs/2309.02144, 2023 d . URL https://arxiv.org/abs/2309.02144
-
[51]
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. ArXiv preprint, abs/2307.10635, 2023 e . URL https://arxiv.org/abs/2307.10635
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023 f . URL https://openreview.net/pdf?id=1PL1NIMMrw
work page 2023
-
[53]
Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parma...
work page 2022
-
[54]
arXiv preprint arXiv:2306.04751 , year=
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751, 2023 g . URL https://arxiv.org/abs/2306.04751
-
[55]
Self-instruct: Aligning language model with self generated instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023 h . URL https://aclanthology.org/2023.acl-long.754.pdf
work page 2023
-
[56]
Codet5+: Open code large language models for code understanding and generation
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. ArXiv preprint, abs/2305.07922, 2023 i . URL https://arxiv.org/abs/2305.07922
-
[57]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022 a . URL https://openreview.net/forum?id=gEZrGCozdqR
work page 2022
-
[58]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022 b . URL https://openreview.net/pdf?id=_VjQlMeSB_J
work page 2022
-
[59]
Simple synthetic data reduces sycophancy in large language models
Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. ArXiv preprint, abs/2308.03958, 2023. URL https://arxiv.org/abs/2308.03958
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, et al. Huggingface's transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.03771
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[61]
An explanation of in-context learning as implicit bayesian inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI
work page 2022
-
[62]
Decomposition enhances reasoning via self-evaluation guided decoding
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Decomposition enhances reasoning via self-evaluation guided decoding. ArXiv preprint, abs/2305.00633, 2023. URL https://arxiv.org/abs/2305.00633
-
[63]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Gpt can solve mathematical problems without a calculator
Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. Gpt can solve mathematical problems without a calculator. ArXiv preprint, abs/2309.03241, 2023. URL https://arxiv.org/abs/2309.03241
-
[65]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X
work page 2023
-
[66]
C ross F it: A few-shot learning challenge for cross-task generalization in NLP
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. C ross F it: A few-shot learning challenge for cross-task generalization in NLP . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7163--7189, 2021. doi:10.18653/v1/2021.emnlp-main.572. URL https://aclanthology.org/2021.emnlp-main.572
-
[67]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv preprint, abs/2309.12284, 2023. URL https://arxiv.org/abs/2309.12284
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. ArXiv preprint, abs/2308.01825, 2023. URL https://arxiv.org/abs/2308.01825
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ArXiv preprint, abs/2205.01068, 2022. URL https://arxiv.org/abs/2205.01068
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
Progressive-hint prompting improves reasoning in large language models
Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. ArXiv preprint, abs/2304.09797, 2023 a . URL https://arxiv.org/abs/2304.09797
-
[71]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023 b . URL https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv preprint, abs/2304.06364, 2023. URL https://arxiv.org/abs/2304.06364
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. ArXiv preprint, abs/2308.07921, 2023 a . URL https://arxiv.org/abs/2308.07921
-
[74]
LIMA: Less Is More for Alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023 b . URL https://arxiv.org/abs/2305.11206
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Least-to-most prompting enables complex reasoning in large language models
Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023 c . URL https://openreview.net/pdf?id=WZH7099tgfM
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.