arxiv: 2308.01825 · v2 · submitted 2023-08-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan , Hongyi Yuan , Chengpeng Li , Guanting Dong , Keming Lu , Chuanqi Tan , Chang Zhou , Jingren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords mathematical reasoninglarge language modelsscaling lawsrejection samplingfine-tuningGSM8Kpre-training lossdata augmentation

0 comments

The pith

Pre-training loss predicts LLM mathematical reasoning performance better than parameter count, and rejection sampling fine-tuning lifts LLaMA-7B to 49.3 percent accuracy on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how pre-training loss, the quantity of supervised data, and the quantity of augmented data shape the mathematical reasoning ability of supervised large language models. It reports that pre-training loss correlates more strongly with final accuracy than model size does. Supervised data volume follows a log-linear relationship with performance, yet stronger models gain less from each additional example. The authors introduce rejection sampling fine-tuning to generate and retain only correct reasoning paths from the model itself, producing larger gains for weaker models and when paths are more diverse.

Core claim

The authors show that mathematical reasoning accuracy scales log-linearly with the volume of supervised fine-tuning data and that this scaling is steeper for models with higher pre-training loss. They further show that rejection sampling fine-tuning, which collects verified correct reasoning paths generated by the supervised models and uses them as additional training data, improves accuracy beyond standard supervised fine-tuning, with the largest gains occurring when samples from multiple models are pooled.

What carries the argument

Rejection sampling fine-tuning (RFT), which generates candidate reasoning paths from the model, retains only those paths verified as correct, and fine-tunes on the retained set.

Load-bearing premise

Model-generated reasoning paths can be reliably labeled correct by the same or similar models without systematic false positives in the filter.

What would settle it

Replace the model-based verification step with random acceptance at the same rate and measure whether the reported accuracy gains on GSM8K disappear.

read the original abstract

Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre-training loss beats parameter count for predicting math reasoning, and RFT lifts LLaMA-7B to 49.3% on GSM8K via exact-answer filtering.

read the letter

The main things to know are that pre-training loss is a stronger predictor of downstream math performance than model size, and their rejection sampling fine-tuning (RFT) method produces a clear 13-point gain on GSM8K for LLaMA-7B. They generate many reasoning paths, keep only those whose final answer matches the ground truth, and fine-tune on the filtered set. Combining paths across several models adds enough diversity to reach 49.3% versus 35.9% from plain SFT. They also report a log-linear relationship between supervised data volume and accuracy, with stronger base models improving less from extra data. These are straightforward empirical observations that line up with the numbers they show. The verification step uses exact final-answer matching rather than model judgment, which removes the usual concern about false positives in the filter. The trends are consistent across the experiments they describe, and the method itself is simple to implement. The main soft spots are the single-benchmark focus on GSM8K and the lack of error bars or repeated runs, which makes it harder to gauge how stable the gains are. It would help to see the same patterns on MATH or another dataset. Still, the core claims rest on direct comparisons rather than any circular construction. This is the kind of paper that people working on LLM reasoning and data augmentation will want to read and test. It shows honest engagement with the scaling questions in this domain and deserves a serious referee.

Referee Report

2 major / 3 minor

Summary. The paper investigates scaling relationships for mathematical reasoning in LLMs. It reports that pre-training loss correlates more strongly with downstream performance than model parameter count, identifies a log-linear relationship between the volume of supervised fine-tuning data and accuracy on math benchmarks, and introduces Rejection sampling Fine-Tuning (RFT) that augments training sets with model-generated reasoning paths whose final answers match ground truth. Combining RFT samples across multiple models raises LLaMA-7B accuracy on GSM8K from 35.9% (SFT) to 49.3%.

Significance. If the empirical trends hold under broader controls, the work supplies practical, annotation-free methods for boosting LLM math reasoning and clarifies which pre-training metrics best forecast downstream capability. The demonstration that RFT yields larger gains for weaker base models and that distinct reasoning paths matter more than sheer volume is a concrete, reproducible contribution to efficient fine-tuning.

major comments (2)

[Section 4] Section 4 (RFT experiments): the headline 49.3% GSM8K accuracy for LLaMA-7B is reported without error bars, number of random seeds, or variance across runs; the 13.4-point gain over the 35.9% SFT baseline therefore cannot yet be assessed for statistical reliability.
[Section 3.2] Section 3.2 (scaling with supervised data): the claimed log-linear relation is shown visually but lacks the fitted slope, intercept, or R² value; without these statistics it is impossible to judge how well the functional form actually describes the observed points or whether the “better models improve less” interaction is significant.

minor comments (3)

[Abstract / Section 3.1] The abstract and Section 3.1 should explicitly state the exact pre-training loss metric (e.g., perplexity on which corpus) used to rank models, so readers can replicate the “better indicator than parameter count” comparison.
[Table 1] Table 1 or the corresponding results table should report the number of distinct reasoning paths retained after rejection sampling for each model size; this quantity is central to the claim that “augmented samples containing more distinct reasoning paths” drive the gains.
[Figures 2-4] Figure captions for the scaling plots should include the exact GSM8K test-set size and whether accuracy is computed with exact-match final-answer verification only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the valuable suggestions. We will address the major comments by providing additional statistical details in the revised manuscript.

read point-by-point responses

Referee: [Section 4] Section 4 (RFT experiments): the headline 49.3% GSM8K accuracy for LLaMA-7B is reported without error bars, number of random seeds, or variance across runs; the 13.4-point gain over the 35.9% SFT baseline therefore cannot yet be assessed for statistical reliability.

Authors: We agree with the referee that reporting error bars and details on random seeds is necessary to assess statistical reliability. In the revised version, we will include results averaged over multiple random seeds (specifically, we will report means and standard deviations from 3 independent runs) and add error bars to the relevant figures and tables in Section 4. revision: yes
Referee: [Section 3.2] Section 3.2 (scaling with supervised data): the claimed log-linear relation is shown visually but lacks the fitted slope, intercept, or R² value; without these statistics it is impossible to judge how well the functional form actually describes the observed points or whether the “better models improve less” interaction is significant.

Authors: We appreciate this feedback. We will augment Section 3.2 with the fitted parameters (slope and intercept) and the R² value for the log-linear relationship. We will also include a statistical analysis of the interaction effect to evaluate the significance of the finding that better models improve less with additional supervised data. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports purely empirical results: measured pre-training loss as a predictor of downstream GSM8K accuracy, observed log-linear scaling of performance with supervised data volume, and accuracy gains from RFT (rejection sampling of paths whose final answer matches ground truth). All claims rest on direct experimental comparisons against SFT baselines and external benchmarks; no derivation, equation, or first-principles argument is offered that reduces to its own inputs by construction. No self-citation is used to justify a uniqueness theorem or ansatz, and the rejection filter relies on exact answer matching rather than model-generated verification. The work is therefore self-contained against external data and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical. The main unstated premise is that automatic verification of generated reasoning paths is accurate enough to serve as ground truth for further training.

axioms (1)

domain assumption Model-generated reasoning paths can be filtered for correctness without systematic bias or false acceptance
RFT procedure depends on this filter to produce usable augmented data.

pith-pipeline@v0.9.0 · 5537 in / 1271 out tokens · 53407 ms · 2026-05-15T00:16:47.344407+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
cs.CL 2026-05 unverdicted novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Step Rejection Fine-Tuning: A Practical Distillation Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
cs.AI 2026-05 unverdicted novelty 6.0

CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
Distillation Traps and Guards: A Calibration Knob for LLM Distillability
cs.LG 2026-04 unverdicted novelty 6.0

Reinforcement fine-tuning calibration makes LLM distillability adjustable, allowing optimized knowledge transfer or model IP safeguards via a combined task-KL-calibration objective.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
AIRA_2: Overcoming Bottlenecks in AI Research Agents
cs.AI 2026-03 conditional novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...
SAM 3D: 3Dfy Anything in Images
cs.CV 2025-11 unverdicted novelty 6.0

SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Language as a Latent Variable for Reasoning Optimization
cs.CL 2026-04 unverdicted novelty 5.0

Treating language as a latent variable via polyGRPO RL improves Qwen2.5-7B-Instruct by 6.72% on English reasoning benchmarks and 6.89% on multilingual ones, with cross-task gains on commonsense reasoning from math-onl...
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
cs.CL 2026-04 unverdicted novelty 5.0

H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor
cs.AI 2026-04 unverdicted novelty 5.0

PsychAgent combines memory-augmented planning, trajectory-based skill evolution, and rejection fine-tuning to create a self-improving AI psychological counselor that outperforms general LLMs in multi-session evaluations.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 20 Pith papers · 9 internal anchors

[2]

Emergent Abilities of Large Language Models , author=. Trans. Mach. Learn. Res. , year=

work page
[3]

ArXiv , year=

Finetuned Language Models Are Zero-Shot Learners , author=. ArXiv , year=

work page
[4]

ArXiv , year=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=

work page
[5]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[6]

2023 , eprint=

Scaling Data-Constrained Language Models , author=. 2023 , eprint=

work page 2023
[8]

2021 , eprint=

Scaling Laws for Transfer , author=. 2021 , eprint=

work page 2021
[9]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[10]

2022 , eprint=

Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

work page 2022
[12]

The Eleventh International Conference on Learning Representations , year=

Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions , author=. The Eleventh International Conference on Learning Representations , year=

work page
[13]

Distilling Reasoning Capabilities into Smaller Language Models

Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya. Distilling Reasoning Capabilities into Smaller Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023

work page 2023
[14]

2022 , eprint=

Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=

work page 2022
[15]

2023 , eprint=

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , author=. 2023 , eprint=

work page 2023
[16]

2023 , eprint=

RRHF: Rank Responses to Align Language Models with Human Feedback without tears , author=. 2023 , eprint=

work page 2023
[17]

2022 , eprint=

Large Language Models Can Self-Improve , author=. 2022 , eprint=

work page 2022
[18]

2022 , url=

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

work page 2022
[19]

Solving Math Word Problems via Cooperative Reasoning induced Language Models

Zhu, Xinyu and Wang, Junjie and Zhang, Lin and Zhang, Yuxiang and Huang, Yongfeng and Gan, Ruyi and Zhang, Jiaxing and Yang, Yujiu. Solving Math Word Problems via Cooperative Reasoning induced Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[21]

Making Language Models Better Reasoners with Step-Aware Verifier

Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu. Making Language Models Better Reasoners with Step-Aware Verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[23]

2021 , eprint=

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers , author=. 2021 , eprint=

work page 2021
[26]

2021 , eprint=

Show Your Work: Scratchpads for Intermediate Computation with Language Models , author=. 2021 , eprint=

work page 2021
[27]

Advances in Neural Information Processing Systems , editor=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[28]

Hindsight Experience Replay , url =

Andrychowicz, Marcin and Wolski, Filip and Ray, Alex and Schneider, Jonas and Fong, Rachel and Welinder, Peter and McGrew, Bob and Tobin, Josh and Pieter Abbeel, OpenAI and Zaremba, Wojciech , booktitle =. Hindsight Experience Replay , url =

work page
[29]

2023 , howpublished =

Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =

work page 2023
[30]

2023 , eprint=

The Wisdom of Hindsight Makes Language Models Better Instruction Followers , author=. 2023 , eprint=

work page 2023
[31]

2023 , eprint=

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2023 , eprint=

work page 2023
[32]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[33]

2022 , eprint=

Generating Sequences by Learning to Self-Correct , author=. 2022 , eprint=

work page 2022
[34]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022
[35]

2022 , eprint=

PEER: A Collaborative Language Model , author=. 2022 , eprint=

work page 2022
[36]

2023 , eprint=

Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

work page 2023
[37]

2022 , eprint=

Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=

work page 2022
[40]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[41]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[42]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[43]

Training Trajectories of Language Models Across Scales

Xia, Mengzhou and Artetxe, Mikel and Zhou, Chunting and Lin, Xi Victoria and Pasunuru, Ramakanth and Chen, Danqi and Zettlemoyer, Luke and Stoyanov, Veselin. Training Trajectories of Language Models Across Scales. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[44]

ArXiv , year=

Language Models are Few-Shot Learners , author=. ArXiv , year=

work page
[45]

2023 , eprint=

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author=. 2023 , eprint=

work page 2023
[47]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[49]

2019 , eprint=

Analysing Mathematical Reasoning Abilities of Neural Models , author=. 2019 , eprint=

work page 2019
[50]

2023 , eprint=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

work page 2023
[51]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

work page 2022
[53]

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities , author=

work page
[54]

Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface of Instruction Collection, Parameter-efficient Methods, and Large Language Models , year =

Qingyi, Si and Tong, Wang and Naibin, Gu and Rui, Liu and Zheng, Lin , school =. Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface of Instruction Collection, Parameter-efficient Methods, and Large Language Models , year =. GitHub repository , howpublished =

work page
[56]

Wang, Ben and Komatsuzaki, Aran , title =

work page
[59]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curr...

work page 2017
[60]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[62]

org/10.5281/zenodo.5297715

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021. URL https://doi.org/10.5281/zenodo.5297715

work page doi:10.5281/zenodo.5297715 2021
[63]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[64]

2022 , publisher =

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022

work page arXiv 2022
[65]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[66]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Raft: Reward ranked finetuning for generative foundation model alignment, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment, 2023

work page 2023
[68]

Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023 a

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023 a

work page 2023
[69]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023 b

work page arXiv 2023
[70]

Scaling laws for reward model overoptimization, 2022

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022

work page 2022
[71]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[72]

Scaling laws for transfer, 2021

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer, 2021

work page 2021
[73]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[74]

Large language models can self-improve, 2022

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022

work page 2022
[75]

Learning to reason deductively: Math word problem solving as complex relation extraction

Zhanming Jie, Jierui Li, and Wei Lu. Learning to reason deductively: Math word problem solving as complex relation extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5944--5955, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-lo...

work page doi:10.18653/v1/2022.acl-long.410 2022
[76]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[77]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf

work page 2022
[78]

MAWPS : A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational L...

work page doi:10.18653/v1/n16-1136 2016
[79]

Mwptoolkit: An open-source framework for deep learning-based math word problem solvers, 2021

Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers, 2021

work page 2021
[80]

Solving quantitative reasoning problems with language models, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

work page 2022
[81]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, Toronto, Canada, July 2023. Association for Computational Linguistics....

work page 2023
[82]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[83]

Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models, 2023

work page 2023
[84]

Learning math reasoning from self-sampled correct and partially-correct solutions

Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4D4TSJE6-K

work page 2023
[85]

Show your work: Scratchpads for intermediate computation with language models, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models, 2021

work page 2021
[86]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[87]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2...

work page doi:10.18653/v1/2021.naacl-main.168 2021
[88]

Alpaca-cot: An instruction-tuning platform with unified interface of instruction collection, parameter-efficient methods, and large language models

Si Qingyi, Wang Tong, Gu Naibin, Liu Rui, and Lin Zheng. Alpaca-cot: An instruction-tuning platform with unified interface of instruction collection, parameter-efficient methods, and large language models. https://github.com/PhoebusSi/alpaca-CoT, 2023

work page 2023
[89]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, pp.\ 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ...

work page doi:10.1145/3394486.3406703 2020
[90]

Analysing mathematical reasoning abilities of neural models, 2019

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models, 2019

work page 2019
[91]

Distilling reasoning capabilities into smaller language models

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 7059--7073, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-acl.441

work page 2023
[92]

arXiv preprint arXiv:2306.17492 , year=

Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023

work page arXiv 2023
[93]

Internlm: A multilingual language model with progressively enhanced capabilities

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023

work page 2023
[94]

Llama: Open and efficient foundation language models, 2023 a

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023 a

work page 2023
[95]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023

Showing first 80 references.