pith. machine review for the scientific record. sign in

arxiv: 2309.12284 · v4 · submitted 2023-09-21 · 💻 cs.CL · cs.AI

Recognition: no theorem link

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords mathematical reasoninglarge language modelsfine-tuningdata augmentationGSM8KMATHLLaMA-2question rewriting
0
0 comments X

The pith

Rewriting existing math questions from multiple perspectives lets fine-tuned LLaMA-2 models reach 66.4 percent on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that taking standard math problems and rewriting each one several times from fresh angles creates a more effective training set called MetaMathQA. Fine-tuning LLaMA-2 models on this set produces large accuracy jumps on GSM8K and MATH without adding any new external facts or problems. The 7B version hits 66.4 percent on GSM8K and 19.4 percent on MATH, beating earlier open-source models of the same size by double-digit margins. Even the 70B version slightly exceeds GPT-3.5-Turbo on GSM8K. The approach treats the bottleneck in mathematical reasoning as insufficient variety in how problems are presented rather than insufficient raw data volume.

Core claim

By rewriting each original mathematical question from multiple distinct perspectives without introducing external knowledge, the authors create the MetaMathQA dataset that, when used to fine-tune LLaMA-2, produces models with substantially stronger mathematical reasoning capabilities, as measured by accuracy on GSM8K and MATH benchmarks.

What carries the argument

The bootstrapping process of rewriting each question from multiple perspectives to generate diverse training examples in MetaMathQA.

Load-bearing premise

Rewriting questions from multiple perspectives produces sufficiently diverse, high-quality, and non-redundant examples that improve actual reasoning rather than merely increasing data volume.

What would settle it

Train two models on identical numbers of examples, one using the perspective-rewriting process and one using simple duplication or random rephrasing, then compare their GSM8K and MATH scores.

read the original abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MetaMath, a fine-tuning approach for LLaMA-2 models that first bootstraps a new dataset (MetaMathQA) by rewriting existing mathematical questions from multiple perspectives without introducing external knowledge, then trains on this augmented data. It reports large gains on GSM8K (66.4% for 7B, 82.3% for 70B) and MATH (19.4% for 7B), exceeding prior open-source models of comparable size by 11.5% and 8.7% respectively.

Significance. If the performance lift is shown to arise from the diversity and quality of the multi-perspective rewrites rather than from simply increasing training volume, the method supplies a low-cost, knowledge-free data-augmentation recipe that could be applied to other reasoning domains and would materially narrow the gap between open-source and closed-source mathematical reasoning models.

major comments (3)
  1. [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.
  2. [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.
  3. [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.
minor comments (2)
  1. [Figure 1] Figure 1 caption and surrounding text use inconsistent terminology ('forward' vs. 'backward' rewriting) that is never formally defined.
  2. [Abstract] The abstract states that MetaMath-70B is 'slightly better than GPT-3.5-Turbo' on GSM8K, but the main text does not report the exact GPT-3.5-Turbo score used for this comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of reproducibility, experimental controls, and statistical reporting that will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.

    Authors: We agree that additional details are required for full reproducibility. In the revised manuscript we will add the exact prompts used for multi-perspective rewriting in an appendix. We generate four rewrites per seed question. For quality control we apply an automated filter that solves both the original and rewritten questions with a symbolic solver and discards any pair whose answers differ; we also remove near-duplicates via embedding similarity. These steps will be described in detail. revision: yes

  2. Referee: [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.

    Authors: We acknowledge that a volume-controlled ablation is necessary to isolate the contribution of multi-perspective rewriting. We will add this experiment in the revised version: we train on (i) the original GSM8K/MATH data duplicated to match the example count of MetaMathQA, (ii) random paraphrases generated with the same model and prompt style but without the multi-perspective instruction, and (iii) MetaMathQA itself, keeping total training tokens fixed. Results will be reported in an updated Table 2. revision: yes

  3. Referee: [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

    Authors: We agree that reporting variance would increase confidence in the results. Due to compute limits we performed single runs for the 70B model; for the 7B model we will rerun with three random seeds and report mean and standard deviation. We will also add a brief discussion noting that the observed margins (11.5% and 8.7%) substantially exceed typical run-to-run variance observed in similar fine-tuning settings. These changes will appear in Section 4.3 and the tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical augmentation evaluated on external benchmarks

full rationale

The paper describes an empirical pipeline—rewriting existing math questions from multiple perspectives to create MetaMathQA, then fine-tuning LLaMA-2 models on the resulting dataset and measuring accuracy on the fixed external benchmarks GSM8K and MATH. No equations, fitted parameters, or self-referential quantities are presented as predictions. No load-bearing self-citations or uniqueness theorems are invoked. The central results are direct performance numbers on independent test sets after training, with no reduction of any claimed derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard supervised fine-tuning and data-augmentation assumptions.

pith-pipeline@v0.9.0 · 5570 in / 1136 out tokens · 48427 ms · 2026-05-13T10:00:51.493241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  3. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  4. Generating Leakage-Free Benchmarks for Robust RAG Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.

  5. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  6. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  7. Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

    cs.AI 2026-05 unverdicted novelty 6.0

    JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...

  8. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  9. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  10. HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.

  11. Sensitivity-Positional Co-Localization in GQA Transformers

    cs.CL 2026-04 unverdicted novelty 6.0

    In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...

  12. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  13. Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    cs.CL 2024-06 conditional novelty 6.0

    OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

  14. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  15. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  16. Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    cs.LG 2026-05 unverdicted novelty 5.0

    NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

  17. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  18. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.

  19. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

  20. Post-Optimization Adaptive Rank Allocation for LoRA

    cs.AI 2026-04 unverdicted novelty 5.0

    PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.

  21. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  22. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  23. Rethinking Wireless Communications through Formal Mathematical AI Reasoning

    eess.SP 2026-04 unverdicted novelty 4.0

    Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.

  24. Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

    cs.CL 2026-04 unverdicted novelty 4.0

    AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.

  25. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  26. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  27. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 25 Pith papers · 19 internal anchors

  1. [1]

    Alibaba. Qwen-7b. Technical Report, 2023

  2. [2]

    R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catast...

  3. [3]

    Azerbayev, H

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. Dos, S. McAleer, A. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics. In International Conference on Learning Representations, 2024

  4. [4]

    Baichuan 2

    BaichuanInc. Baichuan 2. Technical Report, 2023

  5. [5]

    A is B” Fail to Learn “B is A

    L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. Stickland, T. Korbak, and O. Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”. InInternational Conference on Learning Representations, 2024

  6. [6]

    J. Bilmes. Submodularity In Machine Learning and Artificial Intelligence. Preprint arXiv:2202.00132, 2022

  7. [7]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. Gus...

  9. [9]

    W. Chen, X. Ma, X. Wang, and W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Preprint arXiv:2211.12588, 2022. 10 Published as a conference paper at ICLR 2024

  10. [10]

    Y . Chen, R. Zhong, S. Zha, G. Karypis, and H. He. Meta-learning via Language Model In-context Tuning. In Annual Meeting of the Association for Computational Linguistics, 2022

  11. [11]

    Chiang, Z

    W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Technical Report, 2023

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H....

  13. [13]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168, 2021

  14. [14]

    Collins, A

    K. Collins, A. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y . Wu, J. Tenen- baum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik. Evaluating Language Models for Mathematics through Interactions. Preprint arXiv:2306.01694, 2023

  15. [15]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized llms. Preprint arXiv:2305.14314, 2023

  16. [16]

    Devlin, M

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics, 2019

  17. [17]

    D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019

  18. [18]

    Tinystories: How small can language models be and still speak coherent english?

    R. Eldan and Y . Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Preprint arXiv:2305.07759, 2023

  19. [19]

    Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing Smaller Language Models towards Multi-Step Reasoning. In International Conference on Machine Learning, 2023

  20. [20]

    Y . Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-Based Prompting for Multi- step Reasoning. In International Conference on Learning Representations, 2023

  21. [21]

    J. Gou, B. Yu, S. Maybank, and D. Tao. Knowledge Distillation: A Survey. International Journal of Computer Vision, 2021

  22. [22]

    T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y . Yan. Knowledge Adaptation for Efficient Semantic Segmentation. In Computer Vision and Pattern Recognition, 2019

  23. [23]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In Neural Information Processing Systems: Datasets and Benchmarks, 2021

  24. [24]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531, 2015

  25. [25]

    N. Ho, L. Schmid, and S. Yun. Large Language Models Are Reasoning Teachers. In Annual Meeting of the Association for Computational Linguistics, 2023. 11 Published as a conference paper at ICLR 2024

  26. [26]

    Hsieh, C

    C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Annual Meeting of the Association for Computational Linguistics , 2023

  27. [27]

    Huang, S

    J. Huang, S. Gu, L. Hou, Y . Wu, X. Wang, H. Yu, and J. Han. Large Language Models Can Self-Improve. Preprint arXiv:2210.11610, 2022

  28. [28]

    Imani, L

    S. Imani, L. Du, and H. Shrivastava. MathPrompter: Mathematical Reasoning using Large Language Models. In Annual Meeting of the Association for Computational Linguistics, 2023

  29. [29]

    InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities

    InternLM. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. Technical Report, 2023

  30. [30]

    Mistral 7B

    A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, F. Bressand D. Casas, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M. Lachaux, P. Stock, T. Scao, T. Lavril, T. Wang, and T. Lacroixand W. Sayed. Mistral 7B. Preprint arXiv:2310.06825, 2023

  31. [31]

    Jiang, B

    W. Jiang, B. Lin, H. Shi, Y . Zhang, Z. Li, and J. Kwok. BYOM: Building Your Own Multi-Task Model for Free. Preprint arXiv:2310.01886, 2023

  32. [32]

    Jiang, H

    W. Jiang, H. Shi, L. Yu, Z. Liu, Y . Zhang, Z. Li, and J. Kwok. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. Preprint arXiv:2308.07758, 2023

  33. [33]

    Jiang, Y

    W. Jiang, Y . Zhang, and J. Kwok. Effective Structured-Prompting by Meta-Learning and Representitive Verbalizer. In International Conference on Machine Learning, 2023

  34. [34]

    Kilbertus, G

    N. Kilbertus, G. Parascandolo, and B. Sch¨olkopf. Generalization in anti-causal learning. Preprint arXiv:1812.00524, 2018

  35. [35]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving Quantitative Reasoning Problems with Language Models. In Neural Information Processing Systems, 2022

  36. [36]

    R. Li, L. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy- Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Patel, D. Abulkha...

  37. [37]

    S. Li, J. Chen, Y . Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y . Mao, W. Chen, and X. Yan. Explanations from Large Language Models Make Small Reasoners Better. Preprint arXiv:2210.06726, 2022

  38. [38]

    X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han. DeepInception: Hypnotize Large Language Model to be Jailbreaker. Preprint arXiv:2311.03191, 2023

  39. [39]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

  40. [40]

    W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. Smith, J. Rehg, and L. Song. Iterative Machine Teaching. In International Conference on Machine Learning, 2017

  41. [41]

    W. Liu, Z. Liu, H. Wang, L. Paull, B. Sch¨olkopf, and A. Weller. Iterative Teaching by Label Synthesis. In Neural Information Processing Systems, 2021. 12 Published as a conference paper at ICLR 2024

  42. [42]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint arXiv:1907.11692, 2019

  43. [43]

    H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. Preprint arXiv:2308.09583, 2023

  44. [44]

    Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In International Conference on Learning Representations, 2024

  45. [45]

    Magister, J

    L. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn. Teaching Small Language Models to Reason. In Annual Meeting of the Association for Computational Linguistics, 2023

  46. [46]

    arXiv preprint arXiv:2309.04564 , year=

    M. Marion, A. ¨Ust¨un, L. Pozzobon, A. Wang, M. Fadaee, and S. Hooker. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. Preprint arXiv:2309.04564, 2023

  47. [47]

    S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to Learn In Context. In North American Chapter of the Association for Computational Linguistics, 2022

  48. [48]

    Mirzadeh, M

    S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved Knowledge Distillation via Teacher Assistant. In AAAI Conference on Artificial Intelligence, 2020

  49. [49]

    Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

    MosaicML. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Technical Report, 2023

  50. [50]

    2308.07317 , archivePrefix=

    Ariel N., Cole J., and Nataniel R. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. Preprint arXiv:2308.07317, 2023

  51. [51]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Preprint arXiv:2203.13474, 2022

  52. [52]

    OpenAI. GPT-3.5. Technical Report, 2022

  53. [53]

    GPT-3.5-Turbo

    OpenAI. GPT-3.5-Turbo. Technical Report, 2022

  54. [54]

    OpenAI. GPT-4. Technical Report, 2023

  55. [55]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training Language Models to Follow Instructions with Human Feedback. In Neural Information Processing Systems, 2022

  56. [56]

    W. Park, D. Kim, Y . Lu, and M. Cho. Relational Knowledge Distillation. InComputer Vision and Pattern Recognition, 2019

  57. [57]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Preprint arXiv:2306.01116, 2023

  58. [58]

    Z. Qiu, W. Liu, T. Xiao, Z. Liu, U. Bhatt, Y . Luo, A. Weller, and B. Sch ¨olkopf. Iterative Teaching by Data Hallucination. In Artificial Intelligence and Statistics, 2023

  59. [59]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical Report, 2019

  60. [60]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 2020. 13 Published as a conference paper at ICLR 2024

  61. [61]

    Code Llama: Open Foundation Models for Code

    B. Rozi`ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code Llama: Open Foundation Models for Code. Preprint arXiv:2308.12950, 2023

  62. [62]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347, 2017

  63. [63]

    P. Shen, X. Lu, S. Li, and H. Kawai. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification. In International Speech Communi- cation Association, 2018

  64. [64]

    Shridhar, A

    K. Shridhar, A. Stolfo, and M. Sachan. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics, 2023

  65. [65]

    J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, et al. A Survey of Reasoning with Foundation Models. Preprint arXiv:2312.11562, 2023

  66. [66]

    Talmor, J

    A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In North American Chapter of the Association for Computational Linguistics, 2019

  67. [67]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model. Technical report, 2023

  68. [68]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A Large Language Model for Science. Preprint arXiv:2211.09085, 2022

  69. [69]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foundation Language Models. Preprint arXiv:2302.13971, 2023

  70. [70]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Ferrer, M. Chen, G. Cucurull, D. Es- iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Kore...

  71. [71]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Technical Report, 2021

  72. [72]

    P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making Large Language Models Better Reasoners with Alignment. Preprint arXiv:2309.02144, 2023

  73. [73]

    T. Wang, J. Zhu, A. Torralba, and A. Efros. Dataset Distillation. Preprint arXiv:1811.10959, 2018

  74. [74]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

  75. [75]

    J. Wei, X. Wang, D. Schuurmans, Maarten Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Neural Information Processing Systems, 2022

  76. [76]

    Y . Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao. Large Language Models are Better Reasoners with Self-Verification. In Conference on Empirical Methods in Natural Language Processing, 2023. 14 Published as a conference paper at ICLR 2024

  77. [77]

    H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, H. Liao, and X. Liang. Lego-Prover: Neural theorem proving with growing libraries. In International Conference on Learning Representations, 2024

  78. [78]

    Xiong, Z

    J. Xiong, Z. Li, C. Zheng, Z. Guo, Y . Yin, E. Xie, Z. Yang, Q. Cao, H. Wang, X. Han, J. Tang, C. Li, and X. Liang. DQ-LoRE: Dual queries with low rank approximation re-ranking for in-context learning. In International Conference on Learning Representations, 2024

  79. [79]

    Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. Preprint arXiv:2308.01825, 2023

  80. [80]

    X. Yue, X. Qu, G. Zhang, Y . Fu, W. Huang, H. Sun, Y . Su, and W. Chen. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In International Conference on Learning Representations, 2024

Showing first 80 references.