arxiv: 2309.12284 · v4 · submitted 2023-09-21 · 💻 cs.CL · cs.AI

Recognition: no theorem link

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu , Weisen Jiang , Han Shi , Jincheng Yu , Zhengying Liu , Yu Zhang , James T. Kwok , Zhenguo Li

show 2 more authors

Adrian Weller Weiyang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mathematical reasoninglarge language modelsfine-tuningdata augmentationGSM8KMATHLLaMA-2question rewriting

0 comments

The pith

Rewriting existing math questions from multiple perspectives lets fine-tuned LLaMA-2 models reach 66.4 percent on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that taking standard math problems and rewriting each one several times from fresh angles creates a more effective training set called MetaMathQA. Fine-tuning LLaMA-2 models on this set produces large accuracy jumps on GSM8K and MATH without adding any new external facts or problems. The 7B version hits 66.4 percent on GSM8K and 19.4 percent on MATH, beating earlier open-source models of the same size by double-digit margins. Even the 70B version slightly exceeds GPT-3.5-Turbo on GSM8K. The approach treats the bottleneck in mathematical reasoning as insufficient variety in how problems are presented rather than insufficient raw data volume.

Core claim

By rewriting each original mathematical question from multiple distinct perspectives without introducing external knowledge, the authors create the MetaMathQA dataset that, when used to fine-tune LLaMA-2, produces models with substantially stronger mathematical reasoning capabilities, as measured by accuracy on GSM8K and MATH benchmarks.

What carries the argument

The bootstrapping process of rewriting each question from multiple perspectives to generate diverse training examples in MetaMathQA.

Load-bearing premise

Rewriting questions from multiple perspectives produces sufficiently diverse, high-quality, and non-redundant examples that improve actual reasoning rather than merely increasing data volume.

What would settle it

Train two models on identical numbers of examples, one using the perspective-rewriting process and one using simple duplication or random rephrasing, then compare their GSM8K and MATH scores.

read the original abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaMath gets real benchmark lifts on GSM8K and MATH by rewriting questions into MetaMathQA and fine-tuning LLaMA-2, but the gains could be mostly from extra data volume rather than the rewriting trick itself.

read the letter

The main thing here is that they bootstrap a new dataset called MetaMathQA by rewriting existing math questions from several angles without adding outside knowledge, then fine-tune LLaMA-2 models on it. The 7B version reaches 66.4% on GSM8K and 19.4% on MATH, beating other open models of the same size by 11.5 and 8.7 points. The 70B model hits 82.3% on GSM8K, which is a bit above GPT-3.5-Turbo. They release the full dataset, the models, and the training code, which is the most immediately useful part of the work.

Referee Report

3 major / 2 minor

Summary. The paper proposes MetaMath, a fine-tuning approach for LLaMA-2 models that first bootstraps a new dataset (MetaMathQA) by rewriting existing mathematical questions from multiple perspectives without introducing external knowledge, then trains on this augmented data. It reports large gains on GSM8K (66.4% for 7B, 82.3% for 70B) and MATH (19.4% for 7B), exceeding prior open-source models of comparable size by 11.5% and 8.7% respectively.

Significance. If the performance lift is shown to arise from the diversity and quality of the multi-perspective rewrites rather than from simply increasing training volume, the method supplies a low-cost, knowledge-free data-augmentation recipe that could be applied to other reasoning domains and would materially narrow the gap between open-source and closed-source mathematical reasoning models.

major comments (3)

[Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.
[Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.
[Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

minor comments (2)

[Figure 1] Figure 1 caption and surrounding text use inconsistent terminology ('forward' vs. 'backward' rewriting) that is never formally defined.
[Abstract] The abstract states that MetaMath-70B is 'slightly better than GPT-3.5-Turbo' on GSM8K, but the main text does not report the exact GPT-3.5-Turbo score used for this comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of reproducibility, experimental controls, and statistical reporting that will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.

Authors: We agree that additional details are required for full reproducibility. In the revised manuscript we will add the exact prompts used for multi-perspective rewriting in an appendix. We generate four rewrites per seed question. For quality control we apply an automated filter that solves both the original and rewritten questions with a symbolic solver and discards any pair whose answers differ; we also remove near-duplicates via embedding similarity. These steps will be described in detail. revision: yes
Referee: [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.

Authors: We acknowledge that a volume-controlled ablation is necessary to isolate the contribution of multi-perspective rewriting. We will add this experiment in the revised version: we train on (i) the original GSM8K/MATH data duplicated to match the example count of MetaMathQA, (ii) random paraphrases generated with the same model and prompt style but without the multi-perspective instruction, and (iii) MetaMathQA itself, keeping total training tokens fixed. Results will be reported in an updated Table 2. revision: yes
Referee: [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

Authors: We agree that reporting variance would increase confidence in the results. Due to compute limits we performed single runs for the 70B model; for the 7B model we will rerun with three random seeds and report mean and standard deviation. We will also add a brief discussion noting that the observed margins (11.5% and 8.7%) substantially exceed typical run-to-run variance observed in similar fine-tuning settings. These changes will appear in Section 4.3 and the tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical augmentation evaluated on external benchmarks

full rationale

The paper describes an empirical pipeline—rewriting existing math questions from multiple perspectives to create MetaMathQA, then fine-tuning LLaMA-2 models on the resulting dataset and measuring accuracy on the fixed external benchmarks GSM8K and MATH. No equations, fitted parameters, or self-referential quantities are presented as predictions. No load-bearing self-citations or uniqueness theorems are invoked. The central results are direct performance numbers on independent test sets after training, with no reduction of any claimed derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard supervised fine-tuning and data-augmentation assumptions.

pith-pipeline@v0.9.0 · 5570 in / 1136 out tokens · 48427 ms · 2026-05-13T10:00:51.493241+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models
cs.AI 2026-04 unverdicted novelty 6.0

A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.
Sensitivity-Positional Co-Localization in GQA Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
cs.CL 2024-06 conditional novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
cs.LG 2026-05 unverdicted novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
Post-Optimization Adaptive Rank Allocation for LoRA
cs.AI 2026-04 unverdicted novelty 5.0

PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
cs.CL 2026-04 unverdicted novelty 4.0

AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 25 Pith papers · 19 internal anchors

[1]

Alibaba. Qwen-7b. Technical Report, 2023

work page 2023
[2]

R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catast...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Azerbayev, H

Z. Azerbayev, H. Schoelkopf, K. Paster, M. Dos, S. McAleer, A. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics. In International Conference on Learning Representations, 2024

work page 2024
[4]

Baichuan 2

BaichuanInc. Baichuan 2. Technical Report, 2023

work page 2023
[5]

A is B” Fail to Learn “B is A

L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. Stickland, T. Korbak, and O. Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”. InInternational Conference on Learning Representations, 2024

work page 2024
[6]

J. Bilmes. Submodularity In Machine Learning and Artificial Intelligence. Preprint arXiv:2202.00132, 2022

work page arXiv 2022
[7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

work page 2020
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. Gus...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

W. Chen, X. Ma, X. Wang, and W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Preprint arXiv:2211.12588, 2022. 10 Published as a conference paper at ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Y . Chen, R. Zhong, S. Zha, G. Karypis, and H. He. Meta-learning via Language Model In-context Tuning. In Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022
[11]

Chiang, Z

W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Technical Report, 2023

work page 2023
[12]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Collins, A

K. Collins, A. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y . Wu, J. Tenen- baum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik. Evaluating Language Models for Mathematics through Interactions. Preprint arXiv:2306.01694, 2023

work page arXiv 2023
[15]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized llms. Preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[17]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[18]

Tinystories: How small can language models be and still speak coherent english?

R. Eldan and Y . Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Preprint arXiv:2305.07759, 2023

work page arXiv 2023
[19]

Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing Smaller Language Models towards Multi-Step Reasoning. In International Conference on Machine Learning, 2023

work page 2023
[20]

Y . Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-Based Prompting for Multi- step Reasoning. In International Conference on Learning Representations, 2023

work page 2023
[21]

J. Gou, B. Yu, S. Maybank, and D. Tao. Knowledge Distillation: A Survey. International Journal of Computer Vision, 2021

work page 2021
[22]

T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y . Yan. Knowledge Adaptation for Efficient Semantic Segmentation. In Computer Vision and Pattern Recognition, 2019

work page 2019
[23]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In Neural Information Processing Systems: Datasets and Benchmarks, 2021

work page 2021
[24]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

N. Ho, L. Schmid, and S. Yun. Large Language Models Are Reasoning Teachers. In Annual Meeting of the Association for Computational Linguistics, 2023. 11 Published as a conference paper at ICLR 2024

work page 2023
[26]

Hsieh, C

C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Annual Meeting of the Association for Computational Linguistics , 2023

work page 2023
[27]

Huang, S

J. Huang, S. Gu, L. Hou, Y . Wu, X. Wang, H. Yu, and J. Han. Large Language Models Can Self-Improve. Preprint arXiv:2210.11610, 2022

work page arXiv 2022
[28]

Imani, L

S. Imani, L. Du, and H. Shrivastava. MathPrompter: Mathematical Reasoning using Large Language Models. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[29]

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities

InternLM. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. Technical Report, 2023

work page 2023
[30]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, F. Bressand D. Casas, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M. Lachaux, P. Stock, T. Scao, T. Lavril, T. Wang, and T. Lacroixand W. Sayed. Mistral 7B. Preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Jiang, B

W. Jiang, B. Lin, H. Shi, Y . Zhang, Z. Li, and J. Kwok. BYOM: Building Your Own Multi-Task Model for Free. Preprint arXiv:2310.01886, 2023

work page arXiv 2023
[32]

Jiang, H

W. Jiang, H. Shi, L. Yu, Z. Liu, Y . Zhang, Z. Li, and J. Kwok. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. Preprint arXiv:2308.07758, 2023

work page arXiv 2023
[33]

Jiang, Y

W. Jiang, Y . Zhang, and J. Kwok. Effective Structured-Prompting by Meta-Learning and Representitive Verbalizer. In International Conference on Machine Learning, 2023

work page 2023
[34]

Kilbertus, G

N. Kilbertus, G. Parascandolo, and B. Sch¨olkopf. Generalization in anti-causal learning. Preprint arXiv:1812.00524, 2018

work page arXiv 2018
[35]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving Quantitative Reasoning Problems with Language Models. In Neural Information Processing Systems, 2022

work page 2022
[36]

R. Li, L. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy- Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Patel, D. Abulkha...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

S. Li, J. Chen, Y . Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y . Mao, W. Chen, and X. Yan. Explanations from Large Language Models Make Small Reasoners Better. Preprint arXiv:2210.06726, 2022

work page arXiv 2022
[38]

X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han. DeepInception: Hypnotize Large Language Model to be Jailbreaker. Preprint arXiv:2311.03191, 2023

work page arXiv 2023
[39]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

work page 2024
[40]

W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. Smith, J. Rehg, and L. Song. Iterative Machine Teaching. In International Conference on Machine Learning, 2017

work page 2017
[41]

W. Liu, Z. Liu, H. Wang, L. Paull, B. Sch¨olkopf, and A. Weller. Iterative Teaching by Label Synthesis. In Neural Information Processing Systems, 2021. 12 Published as a conference paper at ICLR 2024

work page 2021
[42]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[43]

H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. Preprint arXiv:2308.09583, 2023

work page arXiv 2023
[44]

Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In International Conference on Learning Representations, 2024

work page 2024
[45]

Magister, J

L. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn. Teaching Small Language Models to Reason. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[46]

arXiv preprint arXiv:2309.04564 , year=

M. Marion, A. ¨Ust¨un, L. Pozzobon, A. Wang, M. Fadaee, and S. Hooker. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. Preprint arXiv:2309.04564, 2023

work page arXiv 2023
[47]

S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to Learn In Context. In North American Chapter of the Association for Computational Linguistics, 2022

work page 2022
[48]

Mirzadeh, M

S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved Knowledge Distillation via Teacher Assistant. In AAAI Conference on Artificial Intelligence, 2020

work page 2020
[49]

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

MosaicML. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Technical Report, 2023

work page 2023
[50]

2308.07317 , archivePrefix=

Ariel N., Cole J., and Nataniel R. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. Preprint arXiv:2308.07317, 2023

work page arXiv 2023
[51]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Preprint arXiv:2203.13474, 2022

work page internal anchor Pith review arXiv 2022
[52]

OpenAI. GPT-3.5. Technical Report, 2022

work page 2022
[53]

GPT-3.5-Turbo

OpenAI. GPT-3.5-Turbo. Technical Report, 2022

work page 2022
[54]

OpenAI. GPT-4. Technical Report, 2023

work page 2023
[55]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training Language Models to Follow Instructions with Human Feedback. In Neural Information Processing Systems, 2022

work page 2022
[56]

W. Park, D. Kim, Y . Lu, and M. Cho. Relational Knowledge Distillation. InComputer Vision and Pattern Recognition, 2019

work page 2019
[57]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Preprint arXiv:2306.01116, 2023

work page internal anchor Pith review arXiv 2023
[58]

Z. Qiu, W. Liu, T. Xiao, Z. Liu, U. Bhatt, Y . Luo, A. Weller, and B. Sch ¨olkopf. Iterative Teaching by Data Hallucination. In Artificial Intelligence and Statistics, 2023

work page 2023
[59]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical Report, 2019

work page 2019
[60]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 2020. 13 Published as a conference paper at ICLR 2024

work page 2020
[61]

Code Llama: Open Foundation Models for Code

B. Rozi`ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code Llama: Open Foundation Models for Code. Preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

P. Shen, X. Lu, S. Li, and H. Kawai. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification. In International Speech Communi- cation Association, 2018

work page 2018
[64]

Shridhar, A

K. Shridhar, A. Stolfo, and M. Sachan. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics, 2023

work page 2023
[65]

J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, et al. A Survey of Reasoning with Foundation Models. Preprint arXiv:2312.11562, 2023

work page arXiv 2023
[66]

Talmor, J

A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[67]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model. Technical report, 2023

work page 2023
[68]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A Large Language Model for Science. Preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foundation Language Models. Preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Ferrer, M. Chen, G. Cucurull, D. Es- iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Kore...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Wang and A

B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Technical Report, 2021

work page 2021
[72]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making Large Language Models Better Reasoners with Alignment. Preprint arXiv:2309.02144, 2023

work page arXiv 2023
[73]

T. Wang, J. Zhu, A. Torralba, and A. Efros. Dataset Distillation. Preprint arXiv:1811.10959, 2018

work page arXiv 2018
[74]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

work page 2023
[75]

J. Wei, X. Wang, D. Schuurmans, Maarten Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Neural Information Processing Systems, 2022

work page 2022
[76]

Y . Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao. Large Language Models are Better Reasoners with Self-Verification. In Conference on Empirical Methods in Natural Language Processing, 2023. 14 Published as a conference paper at ICLR 2024

work page 2023
[77]

H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, H. Liao, and X. Liang. Lego-Prover: Neural theorem proving with growing libraries. In International Conference on Learning Representations, 2024

work page 2024
[78]

Xiong, Z

J. Xiong, Z. Li, C. Zheng, Z. Guo, Y . Yin, E. Xie, Z. Yang, Q. Cao, H. Wang, X. Han, J. Tang, C. Li, and X. Liang. DQ-LoRE: Dual queries with low rank approximation re-ranking for in-context learning. In International Conference on Learning Representations, 2024

work page 2024
[79]

Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. Preprint arXiv:2308.01825, 2023

work page internal anchor Pith review arXiv 2023
[80]

X. Yue, X. Qu, G. Zhang, Y . Fu, W. Huang, H. Sun, Y . Su, and W. Chen. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In International Conference on Learning Representations, 2024

work page 2024

Showing first 80 references.