pith. machine review for the scientific record. sign in

arxiv: 2308.01825 · v2 · submitted 2023-08-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords mathematical reasoninglarge language modelsscaling lawsrejection samplingfine-tuningGSM8Kpre-training lossdata augmentation
0
0 comments X

The pith

Pre-training loss predicts LLM mathematical reasoning performance better than parameter count, and rejection sampling fine-tuning lifts LLaMA-7B to 49.3 percent accuracy on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how pre-training loss, the quantity of supervised data, and the quantity of augmented data shape the mathematical reasoning ability of supervised large language models. It reports that pre-training loss correlates more strongly with final accuracy than model size does. Supervised data volume follows a log-linear relationship with performance, yet stronger models gain less from each additional example. The authors introduce rejection sampling fine-tuning to generate and retain only correct reasoning paths from the model itself, producing larger gains for weaker models and when paths are more diverse.

Core claim

The authors show that mathematical reasoning accuracy scales log-linearly with the volume of supervised fine-tuning data and that this scaling is steeper for models with higher pre-training loss. They further show that rejection sampling fine-tuning, which collects verified correct reasoning paths generated by the supervised models and uses them as additional training data, improves accuracy beyond standard supervised fine-tuning, with the largest gains occurring when samples from multiple models are pooled.

What carries the argument

Rejection sampling fine-tuning (RFT), which generates candidate reasoning paths from the model, retains only those paths verified as correct, and fine-tunes on the retained set.

Load-bearing premise

Model-generated reasoning paths can be reliably labeled correct by the same or similar models without systematic false positives in the filter.

What would settle it

Replace the model-based verification step with random acceptance at the same rate and measure whether the reported accuracy gains on GSM8K disappear.

read the original abstract

Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper investigates scaling relationships for mathematical reasoning in LLMs. It reports that pre-training loss correlates more strongly with downstream performance than model parameter count, identifies a log-linear relationship between the volume of supervised fine-tuning data and accuracy on math benchmarks, and introduces Rejection sampling Fine-Tuning (RFT) that augments training sets with model-generated reasoning paths whose final answers match ground truth. Combining RFT samples across multiple models raises LLaMA-7B accuracy on GSM8K from 35.9% (SFT) to 49.3%.

Significance. If the empirical trends hold under broader controls, the work supplies practical, annotation-free methods for boosting LLM math reasoning and clarifies which pre-training metrics best forecast downstream capability. The demonstration that RFT yields larger gains for weaker base models and that distinct reasoning paths matter more than sheer volume is a concrete, reproducible contribution to efficient fine-tuning.

major comments (2)
  1. [Section 4] Section 4 (RFT experiments): the headline 49.3% GSM8K accuracy for LLaMA-7B is reported without error bars, number of random seeds, or variance across runs; the 13.4-point gain over the 35.9% SFT baseline therefore cannot yet be assessed for statistical reliability.
  2. [Section 3.2] Section 3.2 (scaling with supervised data): the claimed log-linear relation is shown visually but lacks the fitted slope, intercept, or R² value; without these statistics it is impossible to judge how well the functional form actually describes the observed points or whether the “better models improve less” interaction is significant.
minor comments (3)
  1. [Abstract / Section 3.1] The abstract and Section 3.1 should explicitly state the exact pre-training loss metric (e.g., perplexity on which corpus) used to rank models, so readers can replicate the “better indicator than parameter count” comparison.
  2. [Table 1] Table 1 or the corresponding results table should report the number of distinct reasoning paths retained after rejection sampling for each model size; this quantity is central to the claim that “augmented samples containing more distinct reasoning paths” drive the gains.
  3. [Figures 2-4] Figure captions for the scaling plots should include the exact GSM8K test-set size and whether accuracy is computed with exact-match final-answer verification only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the valuable suggestions. We will address the major comments by providing additional statistical details in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (RFT experiments): the headline 49.3% GSM8K accuracy for LLaMA-7B is reported without error bars, number of random seeds, or variance across runs; the 13.4-point gain over the 35.9% SFT baseline therefore cannot yet be assessed for statistical reliability.

    Authors: We agree with the referee that reporting error bars and details on random seeds is necessary to assess statistical reliability. In the revised version, we will include results averaged over multiple random seeds (specifically, we will report means and standard deviations from 3 independent runs) and add error bars to the relevant figures and tables in Section 4. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (scaling with supervised data): the claimed log-linear relation is shown visually but lacks the fitted slope, intercept, or R² value; without these statistics it is impossible to judge how well the functional form actually describes the observed points or whether the “better models improve less” interaction is significant.

    Authors: We appreciate this feedback. We will augment Section 3.2 with the fitted parameters (slope and intercept) and the R² value for the log-linear relationship. We will also include a statistical analysis of the interaction effect to evaluate the significance of the finding that better models improve less with additional supervised data. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports purely empirical results: measured pre-training loss as a predictor of downstream GSM8K accuracy, observed log-linear scaling of performance with supervised data volume, and accuracy gains from RFT (rejection sampling of paths whose final answer matches ground truth). All claims rest on direct experimental comparisons against SFT baselines and external benchmarks; no derivation, equation, or first-principles argument is offered that reduces to its own inputs by construction. No self-citation is used to justify a uniqueness theorem or ansatz, and the rejection filter relies on exact answer matching rather than model-generated verification. The work is therefore self-contained against external data and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical. The main unstated premise is that automatic verification of generated reasoning paths is accurate enough to serve as ground truth for further training.

axioms (1)
  • domain assumption Model-generated reasoning paths can be filtered for correctness without systematic bias or false acceptance
    RFT procedure depends on this filter to produce usable augmented data.

pith-pipeline@v0.9.0 · 5537 in / 1271 out tokens · 53407 ms · 2026-05-15T00:16:47.344407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.

  3. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  4. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  5. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  6. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  7. Step Rejection Fine-Tuning: A Practical Distillation Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.

  8. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

    cs.AI 2026-05 unverdicted novelty 6.0

    CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.

  9. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  10. $S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

    cs.LG 2026-05 unverdicted novelty 6.0

    S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.

  11. Distillation Traps and Guards: A Calibration Knob for LLM Distillability

    cs.LG 2026-04 unverdicted novelty 6.0

    Reinforcement fine-tuning calibration makes LLM distillability adjustable, allowing optimized knowledge transfer or model IP safeguards via a combined task-KL-calibration objective.

  12. Agentic Frameworks for Reasoning Tasks: An Empirical Study

    cs.AI 2026-04 unverdicted novelty 6.0

    An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

  13. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  14. AIRA_2: Overcoming Bottlenecks in AI Research Agents

    cs.AI 2026-03 conditional novelty 6.0

    AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...

  15. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  16. Search-o1: Agentic Search-Enhanced Large Reasoning Models

    cs.AI 2025-01 unverdicted novelty 6.0

    Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...

  17. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  18. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  19. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  20. Language as a Latent Variable for Reasoning Optimization

    cs.CL 2026-04 unverdicted novelty 5.0

    Treating language as a latent variable via polyGRPO RL improves Qwen2.5-7B-Instruct by 6.72% on English reasoning benchmarks and 6.89% on multilingual ones, with cross-task gains on commonsense reasoning from math-onl...

  21. H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.

  22. PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

    cs.AI 2026-04 unverdicted novelty 5.0

    PsychAgent combines memory-augmented planning, trajectory-based skill evolution, and rejection fine-tuning to create a self-improving AI psychological counselor that outperforms general LLMs in multi-session evaluations.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 20 Pith papers · 9 internal anchors

  1. [2]

    Emergent Abilities of Large Language Models , author=. Trans. Mach. Learn. Res. , year=

  2. [3]

    ArXiv , year=

    Finetuned Language Models Are Zero-Shot Learners , author=. ArXiv , year=

  3. [4]

    ArXiv , year=

    Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=

  4. [5]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  5. [6]

    2023 , eprint=

    Scaling Data-Constrained Language Models , author=. 2023 , eprint=

  6. [8]

    2021 , eprint=

    Scaling Laws for Transfer , author=. 2021 , eprint=

  7. [9]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  8. [10]

    2022 , eprint=

    Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

  9. [12]

    The Eleventh International Conference on Learning Representations , year=

    Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions , author=. The Eleventh International Conference on Learning Representations , year=

  10. [13]

    Distilling Reasoning Capabilities into Smaller Language Models

    Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya. Distilling Reasoning Capabilities into Smaller Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023

  11. [14]

    2022 , eprint=

    Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=

  12. [15]

    2023 , eprint=

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , author=. 2023 , eprint=

  13. [16]

    2023 , eprint=

    RRHF: Rank Responses to Align Language Models with Human Feedback without tears , author=. 2023 , eprint=

  14. [17]

    2022 , eprint=

    Large Language Models Can Self-Improve , author=. 2022 , eprint=

  15. [18]

    2022 , url=

    Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

  16. [19]

    Solving Math Word Problems via Cooperative Reasoning induced Language Models

    Zhu, Xinyu and Wang, Junjie and Zhang, Lin and Zhang, Yuxiang and Huang, Yongfeng and Gan, Ruyi and Zhang, Jiaxing and Yang, Yujiu. Solving Math Word Problems via Cooperative Reasoning induced Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  17. [21]

    Making Language Models Better Reasoners with Step-Aware Verifier

    Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu. Making Language Models Better Reasoners with Step-Aware Verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  18. [23]

    2021 , eprint=

    MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers , author=. 2021 , eprint=

  19. [26]

    2021 , eprint=

    Show Your Work: Scratchpads for Intermediate Computation with Language Models , author=. 2021 , eprint=

  20. [27]

    Advances in Neural Information Processing Systems , editor=

    Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  21. [28]

    Hindsight Experience Replay , url =

    Andrychowicz, Marcin and Wolski, Filip and Ray, Alex and Schneider, Jonas and Fong, Rachel and Welinder, Peter and McGrew, Bob and Tobin, Josh and Pieter Abbeel, OpenAI and Zaremba, Wojciech , booktitle =. Hindsight Experience Replay , url =

  22. [29]

    2023 , howpublished =

    Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =

  23. [30]

    2023 , eprint=

    The Wisdom of Hindsight Makes Language Models Better Instruction Followers , author=. 2023 , eprint=

  24. [31]

    2023 , eprint=

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2023 , eprint=

  25. [32]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  26. [33]

    2022 , eprint=

    Generating Sequences by Learning to Self-Correct , author=. 2022 , eprint=

  27. [34]

    Transactions on Machine Learning Research , issn=

    Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

  28. [35]

    2022 , eprint=

    PEER: A Collaborative Language Model , author=. 2022 , eprint=

  29. [36]

    2023 , eprint=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

  30. [37]

    2022 , eprint=

    Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=

  31. [40]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  32. [41]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  33. [42]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  34. [43]

    Training Trajectories of Language Models Across Scales

    Xia, Mengzhou and Artetxe, Mikel and Zhou, Chunting and Lin, Xi Victoria and Pasunuru, Ramakanth and Chen, Danqi and Zettlemoyer, Luke and Stoyanov, Veselin. Training Trajectories of Language Models Across Scales. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  35. [44]

    ArXiv , year=

    Language Models are Few-Shot Learners , author=. ArXiv , year=

  36. [45]

    2023 , eprint=

    Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author=. 2023 , eprint=

  37. [47]

    2022 , eprint=

    PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

  38. [49]

    2019 , eprint=

    Analysing Mathematical Reasoning Abilities of Neural Models , author=. 2019 , eprint=

  39. [50]

    2023 , eprint=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

  40. [51]

    2022 , eprint=

    Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

  41. [53]

    InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities , author=

  42. [54]

    Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface of Instruction Collection, Parameter-efficient Methods, and Large Language Models , year =

    Qingyi, Si and Tong, Wang and Naibin, Gu and Rui, Liu and Zheng, Lin , school =. Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface of Instruction Collection, Parameter-efficient Methods, and Large Language Models , year =. GitHub repository , howpublished =

  43. [56]

    Wang, Ben and Komatsuzaki, Aran , title =

  44. [59]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curr...

  45. [60]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  46. [61]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  47. [62]

    org/10.5281/zenodo.5297715

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021. URL https://doi.org/10.5281/zenodo.5297715

  48. [63]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

  49. [64]

    2022 , publisher =

    Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022

  50. [65]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  51. [66]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  52. [67]

    Raft: Reward ranked finetuning for generative foundation model alignment, 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment, 2023

  53. [68]

    Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023 a

    Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023 a

  54. [69]

    Specializing smaller language models towards multi-step reasoning

    Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023 b

  55. [70]

    Scaling laws for reward model overoptimization, 2022

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022

  56. [71]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

  57. [72]

    Scaling laws for transfer, 2021

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021

  58. [73]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  59. [74]

    Large language models can self-improve, 2022

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022

  60. [75]

    Learning to reason deductively: Math word problem solving as complex relation extraction

    Zhanming Jie, Jierui Li, and Wei Lu. Learning to reason deductively: Math word problem solving as complex relation extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5944--5955, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-lo...

  61. [76]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

  62. [77]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf

  63. [78]

    MAWPS : A math word problem repository

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational L...

  64. [79]

    Mwptoolkit: An open-source framework for deep learning-based math word problem solvers, 2021

    Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers, 2021

  65. [80]

    Solving quantitative reasoning problems with language models, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

  66. [81]

    Making language models better reasoners with step-aware verifier

    Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, Toronto, Canada, July 2023. Association for Computational Linguistics....

  67. [82]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

  68. [83]

    Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models, 2023

  69. [84]

    Learning math reasoning from self-sampled correct and partially-correct solutions

    Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4D4TSJE6-K

  70. [85]

    Show your work: Scratchpads for intermediate computation with language models, 2021

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021

  71. [86]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  72. [87]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2...

  73. [88]

    Alpaca-cot: An instruction-tuning platform with unified interface of instruction collection, parameter-efficient methods, and large language models

    Si Qingyi, Wang Tong, Gu Naibin, Liu Rui, and Lin Zheng. Alpaca-cot: An instruction-tuning platform with unified interface of instruction collection, parameter-efficient methods, and large language models. https://github.com/PhoebusSi/alpaca-CoT, 2023

  74. [89]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, pp.\ 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ...

  75. [90]

    Analysing mathematical reasoning abilities of neural models, 2019

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models, 2019

  76. [91]

    Distilling reasoning capabilities into smaller language models

    Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 7059--7073, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-acl.441

  77. [92]

    arXiv preprint arXiv:2306.17492 , year=

    Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023

  78. [93]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023

  79. [94]

    Llama: Open and efficient foundation language models, 2023 a

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

  80. [95]

    Llama 2: Open foundation and fine-tuned chat models, 2023 b

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

Showing first 80 references.