MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Adrian Weller; Han Shi; James T. Kwok; Jincheng Yu; Longhui Yu; Weisen Jiang; Weiyang Liu; Yu Zhang; Zhenguo Li; Zhengying Liu

arxiv: 2309.12284 · v4 · submitted 2023-09-21 · 💻 cs.CL · cs.AI

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu , Weisen Jiang , Han Shi , Jincheng Yu , Zhengying Liu , Yu Zhang , James T. Kwok , Zhenguo Li

show 2 more authors

Adrian Weller Weiyang Liu

This is my paper

Pith reviewed 2026-05-13 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mathematical reasoninglarge language modelsfine-tuningdata augmentationGSM8KMATHLLaMA-2question rewriting

0 comments

The pith

Rewriting existing math questions from multiple perspectives lets fine-tuned LLaMA-2 models reach 66.4 percent on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that taking standard math problems and rewriting each one several times from fresh angles creates a more effective training set called MetaMathQA. Fine-tuning LLaMA-2 models on this set produces large accuracy jumps on GSM8K and MATH without adding any new external facts or problems. The 7B version hits 66.4 percent on GSM8K and 19.4 percent on MATH, beating earlier open-source models of the same size by double-digit margins. Even the 70B version slightly exceeds GPT-3.5-Turbo on GSM8K. The approach treats the bottleneck in mathematical reasoning as insufficient variety in how problems are presented rather than insufficient raw data volume.

Core claim

By rewriting each original mathematical question from multiple distinct perspectives without introducing external knowledge, the authors create the MetaMathQA dataset that, when used to fine-tune LLaMA-2, produces models with substantially stronger mathematical reasoning capabilities, as measured by accuracy on GSM8K and MATH benchmarks.

What carries the argument

The bootstrapping process of rewriting each question from multiple perspectives to generate diverse training examples in MetaMathQA.

Load-bearing premise

Rewriting questions from multiple perspectives produces sufficiently diverse, high-quality, and non-redundant examples that improve actual reasoning rather than merely increasing data volume.

What would settle it

Train two models on identical numbers of examples, one using the perspective-rewriting process and one using simple duplication or random rephrasing, then compare their GSM8K and MATH scores.

read the original abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaMath gets real benchmark lifts on GSM8K and MATH by rewriting questions into MetaMathQA and fine-tuning LLaMA-2, but the gains could be mostly from extra data volume rather than the rewriting trick itself.

read the letter

The main thing here is that they bootstrap a new dataset called MetaMathQA by rewriting existing math questions from several angles without adding outside knowledge, then fine-tune LLaMA-2 models on it. The 7B version reaches 66.4% on GSM8K and 19.4% on MATH, beating other open models of the same size by 11.5 and 8.7 points. The 70B model hits 82.3% on GSM8K, which is a bit above GPT-3.5-Turbo. They release the full dataset, the models, and the training code, which is the most immediately useful part of the work.

Referee Report

3 major / 2 minor

Summary. The paper proposes MetaMath, a fine-tuning approach for LLaMA-2 models that first bootstraps a new dataset (MetaMathQA) by rewriting existing mathematical questions from multiple perspectives without introducing external knowledge, then trains on this augmented data. It reports large gains on GSM8K (66.4% for 7B, 82.3% for 70B) and MATH (19.4% for 7B), exceeding prior open-source models of comparable size by 11.5% and 8.7% respectively.

Significance. If the performance lift is shown to arise from the diversity and quality of the multi-perspective rewrites rather than from simply increasing training volume, the method supplies a low-cost, knowledge-free data-augmentation recipe that could be applied to other reasoning domains and would materially narrow the gap between open-source and closed-source mathematical reasoning models.

major comments (3)

[Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.
[Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.
[Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

minor comments (2)

[Figure 1] Figure 1 caption and surrounding text use inconsistent terminology ('forward' vs. 'backward' rewriting) that is never formally defined.
[Abstract] The abstract states that MetaMath-70B is 'slightly better than GPT-3.5-Turbo' on GSM8K, but the main text does not report the exact GPT-3.5-Turbo score used for this comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of reproducibility, experimental controls, and statistical reporting that will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.

Authors: We agree that additional details are required for full reproducibility. In the revised manuscript we will add the exact prompts used for multi-perspective rewriting in an appendix. We generate four rewrites per seed question. For quality control we apply an automated filter that solves both the original and rewritten questions with a symbolic solver and discards any pair whose answers differ; we also remove near-duplicates via embedding similarity. These steps will be described in detail. revision: yes
Referee: [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.

Authors: We acknowledge that a volume-controlled ablation is necessary to isolate the contribution of multi-perspective rewriting. We will add this experiment in the revised version: we train on (i) the original GSM8K/MATH data duplicated to match the example count of MetaMathQA, (ii) random paraphrases generated with the same model and prompt style but without the multi-perspective instruction, and (iii) MetaMathQA itself, keeping total training tokens fixed. Results will be reported in an updated Table 2. revision: yes
Referee: [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

Authors: We agree that reporting variance would increase confidence in the results. Due to compute limits we performed single runs for the 70B model; for the 7B model we will rerun with three random seeds and report mean and standard deviation. We will also add a brief discussion noting that the observed margins (11.5% and 8.7%) substantially exceed typical run-to-run variance observed in similar fine-tuning settings. These changes will appear in Section 4.3 and the tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical augmentation evaluated on external benchmarks

full rationale

The paper describes an empirical pipeline—rewriting existing math questions from multiple perspectives to create MetaMathQA, then fine-tuning LLaMA-2 models on the resulting dataset and measuring accuracy on the fixed external benchmarks GSM8K and MATH. No equations, fitted parameters, or self-referential quantities are presented as predictions. No load-bearing self-citations or uniqueness theorems are invoked. The central results are direct performance numbers on independent test sets after training, with no reduction of any claimed derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard supervised fine-tuning and data-augmentation assumptions.

pith-pipeline@v0.9.0 · 5570 in / 1136 out tokens · 48427 ms · 2026-05-13T10:00:51.493241+00:00 · methodology

discussion (0)

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
cs.LG 2025-07 unverdicted novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning
cs.LG 2026-05 unverdicted novelty 6.0

FuRA uses block tensor-train factorization with fixed pretrained SVD basis to achieve full-rank spectral preconditioning, outperforming Full FT by +1.37 on LLaMA-3-8B commonsense reasoning and surpassing QLoRA in quan...
Self-Supervised On-Policy Distillation for Reasoning Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 6.0

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models
cs.AI 2026-04 unverdicted novelty 6.0

A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.
Sensitivity-Positional Co-Localization in GQA Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
cs.CL 2026-02 unverdicted novelty 6.0

A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
cs.LG 2026-02 unverdicted novelty 6.0

Extra-CoT trains a semantic compressor on math CoT data, applies mixed-ratio SFT, and uses CHRPO reinforcement learning to achieve over 73% token reduction on MATH-500 with 0.6% accuracy gain on Qwen3-1.7B.
Multi-Token Prediction via Self-Distillation
cs.CL 2026-02 unverdicted novelty 6.0

Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling
cs.CL 2025-08 unverdicted novelty 6.0

InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL...
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI 2025-07 unverdicted novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
cs.LG 2025-06 conditional novelty 6.0

MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
FoNE: Precise Single-Token Number Embeddings via Fourier Features
cs.CL 2025-02 unverdicted novelty 6.0

FoNE encodes numbers as single tokens via Fourier features and outperforms subword and digit-wise embeddings on addition, subtraction, and multiplication with far less data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Scaling Synthetic Data Creation with 1,000,000,000 Personas
cs.CL 2024-06 unverdicted novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
cs.LG 2024-06 conditional novelty 6.0

Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
cs.CL 2024-06 conditional novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
cs.CL 2024-02 unverdicted novelty 6.0

DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Llemma: An Open Language Model For Mathematics
cs.CL 2023-10 unverdicted novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
cs.CL 2023-09 conditional novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
LoCO: Low-rank Compositional Rotation Fine-tuning
cs.LG 2026-05 unverdicted novelty 5.0

LoCO is a PEFT technique that constructs orthogonal transformations via low-rank skew-symmetric matrices and compositional rotation chains with a parallelizable approximation, validated on transformer adaptations.
Strategic Over-Parameterization for Generalizable Low-Rank Adaptation
cs.LG 2026-05 unverdicted novelty 5.0

LoRA-Over injects auxiliary parameters into low-rank adapters during training and decomposes them back into standard LoRA at inference, with static or dynamic scheduling to allocate extra capacity where needed, yieldi...
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
cs.LG 2026-05 unverdicted novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.
Post-Optimization Adaptive Rank Allocation for LoRA
cs.AI 2026-04 unverdicted novelty 5.0

PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
cs.LG 2025-12 unverdicted novelty 5.0

A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
cs.LG 2025-12 unverdicted novelty 5.0

Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning
cs.LG 2025-09 unverdicted novelty 5.0

BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a...
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Efficient Reasoning with Hidden Thinking
cs.CL 2025-01 unverdicted novelty 5.0

Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
Training and Evaluating Language Models with Template-based Data Generation
cs.CL 2024-11 unverdicted novelty 5.0

TDG uses GPT-4 to generate meta-templates that synthesize over 7 million verifiable grade school math problems for training and aligning LLMs on reasoning tasks.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
cs.CL 2026-04 unverdicted novelty 4.0

AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
cs.CL 2025-02 unverdicted novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 52 Pith papers · 24 internal anchors

[1]

Alibaba. Qwen-7b. Technical Report, 2023

work page 2023
[2]

R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catast...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Azerbayev, H

Z. Azerbayev, H. Schoelkopf, K. Paster, M. Dos, S. McAleer, A. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics. In International Conference on Learning Representations, 2024

work page 2024
[4]

Baichuan 2

BaichuanInc. Baichuan 2. Technical Report, 2023

work page 2023
[5]

A is B” Fail to Learn “B is A

L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. Stickland, T. Korbak, and O. Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”. InInternational Conference on Learning Representations, 2024

work page 2024
[6]

J. Bilmes. Submodularity In Machine Learning and Artificial Intelligence. Preprint arXiv:2202.00132, 2022

work page arXiv 2022
[7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

work page 2020
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. Gus...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

W. Chen, X. Ma, X. Wang, and W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Preprint arXiv:2211.12588, 2022. 10 Published as a conference paper at ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Y . Chen, R. Zhong, S. Zha, G. Karypis, and H. He. Meta-learning via Language Model In-context Tuning. In Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022
[11]

Chiang, Z

W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Technical Report, 2023

work page 2023
[12]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Collins, Albert Q

K. Collins, A. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y . Wu, J. Tenen- baum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik. Evaluating Language Models for Mathematics through Interactions. Preprint arXiv:2306.01694, 2023

work page arXiv 2023
[15]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized llms. Preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[17]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[18]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

R. Eldan and Y . Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Preprint arXiv:2305.07759, 2023

work page internal anchor Pith review arXiv 2023
[19]

Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing Smaller Language Models towards Multi-Step Reasoning. In International Conference on Machine Learning, 2023

work page 2023
[20]

Y . Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-Based Prompting for Multi- step Reasoning. In International Conference on Learning Representations, 2023

work page 2023
[21]

J. Gou, B. Yu, S. Maybank, and D. Tao. Knowledge Distillation: A Survey. International Journal of Computer Vision, 2021

work page 2021
[22]

T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y . Yan. Knowledge Adaptation for Efficient Semantic Segmentation. In Computer Vision and Pattern Recognition, 2019

work page 2019
[23]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In Neural Information Processing Systems: Datasets and Benchmarks, 2021

work page 2021
[24]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

N. Ho, L. Schmid, and S. Yun. Large Language Models Are Reasoning Teachers. In Annual Meeting of the Association for Computational Linguistics, 2023. 11 Published as a conference paper at ICLR 2024

work page 2023
[26]

Hsieh, C

C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Annual Meeting of the Association for Computational Linguistics , 2023

work page 2023
[27]

Large Language Models Can Self-Improve

J. Huang, S. Gu, L. Hou, Y . Wu, X. Wang, H. Yu, and J. Han. Large Language Models Can Self-Improve. Preprint arXiv:2210.11610, 2022

work page internal anchor Pith review arXiv 2022
[28]

Imani, L

S. Imani, L. Du, and H. Shrivastava. MathPrompter: Mathematical Reasoning using Large Language Models. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[29]

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities

InternLM. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. Technical Report, 2023

work page 2023
[30]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, F. Bressand D. Casas, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M. Lachaux, P. Stock, T. Scao, T. Lavril, T. Wang, and T. Lacroixand W. Sayed. Mistral 7B. Preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Jiang, B

W. Jiang, B. Lin, H. Shi, Y . Zhang, Z. Li, and J. Kwok. BYOM: Building Your Own Multi-Task Model for Free. Preprint arXiv:2310.01886, 2023

work page arXiv 2023
[32]

Backward reasoning in large language models for verification

W. Jiang, H. Shi, L. Yu, Z. Liu, Y . Zhang, Z. Li, and J. Kwok. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. Preprint arXiv:2308.07758, 2023

work page arXiv 2023
[33]

Jiang, Y

W. Jiang, Y . Zhang, and J. Kwok. Effective Structured-Prompting by Meta-Learning and Representitive Verbalizer. In International Conference on Machine Learning, 2023

work page 2023
[34]

Kilbertus, G

N. Kilbertus, G. Parascandolo, and B. Sch¨olkopf. Generalization in anti-causal learning. Preprint arXiv:1812.00524, 2018

work page arXiv 2018
[35]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving Quantitative Reasoning Problems with Language Models. In Neural Information Processing Systems, 2022

work page 2022
[36]

R. Li, L. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy- Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Patel, D. Abulkha...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

S. Li, J. Chen, Y . Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y . Mao, W. Chen, and X. Yan. Explanations from Large Language Models Make Small Reasoners Better. Preprint arXiv:2210.06726, 2022

work page arXiv 2022
[38]

X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han. DeepInception: Hypnotize Large Language Model to be Jailbreaker. Preprint arXiv:2311.03191, 2023

work page internal anchor Pith review arXiv 2023
[39]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

work page 2024
[40]

W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. Smith, J. Rehg, and L. Song. Iterative Machine Teaching. In International Conference on Machine Learning, 2017

work page 2017
[41]

W. Liu, Z. Liu, H. Wang, L. Paull, B. Sch¨olkopf, and A. Weller. Iterative Teaching by Label Synthesis. In Neural Information Processing Systems, 2021. 12 Published as a conference paper at ICLR 2024

work page 2021
[42]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[43]

H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. Preprint arXiv:2308.09583, 2023

work page internal anchor Pith review arXiv 2023
[44]

Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In International Conference on Learning Representations, 2024

work page 2024
[45]

Magister, J

L. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn. Teaching Small Language Models to Reason. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[46]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

M. Marion, A. ¨Ust¨un, L. Pozzobon, A. Wang, M. Fadaee, and S. Hooker. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. Preprint arXiv:2309.04564, 2023

work page arXiv 2023
[47]

S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to Learn In Context. In North American Chapter of the Association for Computational Linguistics, 2022

work page 2022
[48]

Mirzadeh, M

S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved Knowledge Distillation via Teacher Assistant. In AAAI Conference on Artificial Intelligence, 2020

work page 2020
[49]

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

MosaicML. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Technical Report, 2023

work page 2023
[50]

Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

Ariel N., Cole J., and Nataniel R. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. Preprint arXiv:2308.07317, 2023

work page arXiv 2023
[51]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Preprint arXiv:2203.13474, 2022

work page internal anchor Pith review arXiv 2022
[52]

OpenAI. GPT-3.5. Technical Report, 2022

work page 2022
[53]

GPT-3.5-Turbo

OpenAI. GPT-3.5-Turbo. Technical Report, 2022

work page 2022
[54]

OpenAI. GPT-4. Technical Report, 2023

work page 2023
[55]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training Language Models to Follow Instructions with Human Feedback. In Neural Information Processing Systems, 2022

work page 2022
[56]

W. Park, D. Kim, Y . Lu, and M. Cho. Relational Knowledge Distillation. InComputer Vision and Pattern Recognition, 2019

work page 2019
[57]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Preprint arXiv:2306.01116, 2023

work page internal anchor Pith review arXiv 2023
[58]

Z. Qiu, W. Liu, T. Xiao, Z. Liu, U. Bhatt, Y . Luo, A. Weller, and B. Sch ¨olkopf. Iterative Teaching by Data Hallucination. In Artificial Intelligence and Statistics, 2023

work page 2023
[59]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical Report, 2019

work page 2019
[60]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 2020. 13 Published as a conference paper at ICLR 2024

work page 2020
[61]

Code Llama: Open Foundation Models for Code

B. Rozi`ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code Llama: Open Foundation Models for Code. Preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

P. Shen, X. Lu, S. Li, and H. Kawai. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification. In International Speech Communi- cation Association, 2018

work page 2018
[64]

Shridhar, A

K. Shridhar, A. Stolfo, and M. Sachan. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics, 2023

work page 2023
[65]

J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, et al. A Survey of Reasoning with Foundation Models. Preprint arXiv:2312.11562, 2023

work page arXiv 2023
[66]

Talmor, J

A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[67]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model. Technical report, 2023

work page 2023
[68]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A Large Language Model for Science. Preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foundation Language Models. Preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Ferrer, M. Chen, G. Cucurull, D. Es- iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Kore...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Wang and A

B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Technical Report, 2021

work page 2021
[72]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making Large Language Models Better Reasoners with Alignment. Preprint arXiv:2309.02144, 2023

work page arXiv 2023
[73]

T. Wang, J. Zhu, A. Torralba, and A. Efros. Dataset Distillation. Preprint arXiv:1811.10959, 2018

work page internal anchor Pith review arXiv 2018
[74]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

work page 2023
[75]

J. Wei, X. Wang, D. Schuurmans, Maarten Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Neural Information Processing Systems, 2022

work page 2022
[76]

Y . Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao. Large Language Models are Better Reasoners with Self-Verification. In Conference on Empirical Methods in Natural Language Processing, 2023. 14 Published as a conference paper at ICLR 2024

work page 2023
[77]

H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, H. Liao, and X. Liang. Lego-Prover: Neural theorem proving with growing libraries. In International Conference on Learning Representations, 2024

work page 2024
[78]

Xiong, Z

J. Xiong, Z. Li, C. Zheng, Z. Guo, Y . Yin, E. Xie, Z. Yang, Q. Cao, H. Wang, X. Han, J. Tang, C. Li, and X. Liang. DQ-LoRE: Dual queries with low rank approximation re-ranking for in-context learning. In International Conference on Learning Representations, 2024

work page 2024
[79]

Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. Preprint arXiv:2308.01825, 2023

work page internal anchor Pith review arXiv 2023
[80]

X. Yue, X. Qu, G. Zhang, Y . Fu, W. Huang, H. Sun, Y . Su, and W. Chen. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In International Conference on Learning Representations, 2024

work page 2024

Showing first 80 references.

[1] [1]

Alibaba. Qwen-7b. Technical Report, 2023

work page 2023

[2] [2]

R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catast...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Azerbayev, H

Z. Azerbayev, H. Schoelkopf, K. Paster, M. Dos, S. McAleer, A. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics. In International Conference on Learning Representations, 2024

work page 2024

[4] [4]

Baichuan 2

BaichuanInc. Baichuan 2. Technical Report, 2023

work page 2023

[5] [5]

A is B” Fail to Learn “B is A

L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. Stickland, T. Korbak, and O. Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”. InInternational Conference on Learning Representations, 2024

work page 2024

[6] [6]

J. Bilmes. Submodularity In Machine Learning and Artificial Intelligence. Preprint arXiv:2202.00132, 2022

work page arXiv 2022

[7] [7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

work page 2020

[8] [8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. Gus...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

W. Chen, X. Ma, X. Wang, and W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Preprint arXiv:2211.12588, 2022. 10 Published as a conference paper at ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Y . Chen, R. Zhong, S. Zha, G. Karypis, and H. He. Meta-learning via Language Model In-context Tuning. In Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022

[11] [11]

Chiang, Z

W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Technical Report, 2023

work page 2023

[12] [12]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H....

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Collins, Albert Q

K. Collins, A. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y . Wu, J. Tenen- baum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik. Evaluating Language Models for Mathematics through Interactions. Preprint arXiv:2306.01694, 2023

work page arXiv 2023

[15] [15]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized llms. Preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[17] [17]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[18] [18]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

R. Eldan and Y . Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Preprint arXiv:2305.07759, 2023

work page internal anchor Pith review arXiv 2023

[19] [19]

Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing Smaller Language Models towards Multi-Step Reasoning. In International Conference on Machine Learning, 2023

work page 2023

[20] [20]

Y . Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-Based Prompting for Multi- step Reasoning. In International Conference on Learning Representations, 2023

work page 2023

[21] [21]

J. Gou, B. Yu, S. Maybank, and D. Tao. Knowledge Distillation: A Survey. International Journal of Computer Vision, 2021

work page 2021

[22] [22]

T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y . Yan. Knowledge Adaptation for Efficient Semantic Segmentation. In Computer Vision and Pattern Recognition, 2019

work page 2019

[23] [23]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In Neural Information Processing Systems: Datasets and Benchmarks, 2021

work page 2021

[24] [24]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

N. Ho, L. Schmid, and S. Yun. Large Language Models Are Reasoning Teachers. In Annual Meeting of the Association for Computational Linguistics, 2023. 11 Published as a conference paper at ICLR 2024

work page 2023

[26] [26]

Hsieh, C

C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Annual Meeting of the Association for Computational Linguistics , 2023

work page 2023

[27] [27]

Large Language Models Can Self-Improve

J. Huang, S. Gu, L. Hou, Y . Wu, X. Wang, H. Yu, and J. Han. Large Language Models Can Self-Improve. Preprint arXiv:2210.11610, 2022

work page internal anchor Pith review arXiv 2022

[28] [28]

Imani, L

S. Imani, L. Du, and H. Shrivastava. MathPrompter: Mathematical Reasoning using Large Language Models. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[29] [29]

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities

InternLM. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. Technical Report, 2023

work page 2023

[30] [30]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, F. Bressand D. Casas, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M. Lachaux, P. Stock, T. Scao, T. Lavril, T. Wang, and T. Lacroixand W. Sayed. Mistral 7B. Preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Jiang, B

W. Jiang, B. Lin, H. Shi, Y . Zhang, Z. Li, and J. Kwok. BYOM: Building Your Own Multi-Task Model for Free. Preprint arXiv:2310.01886, 2023

work page arXiv 2023

[32] [32]

Backward reasoning in large language models for verification

W. Jiang, H. Shi, L. Yu, Z. Liu, Y . Zhang, Z. Li, and J. Kwok. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. Preprint arXiv:2308.07758, 2023

work page arXiv 2023

[33] [33]

Jiang, Y

W. Jiang, Y . Zhang, and J. Kwok. Effective Structured-Prompting by Meta-Learning and Representitive Verbalizer. In International Conference on Machine Learning, 2023

work page 2023

[34] [34]

Kilbertus, G

N. Kilbertus, G. Parascandolo, and B. Sch¨olkopf. Generalization in anti-causal learning. Preprint arXiv:1812.00524, 2018

work page arXiv 2018

[35] [35]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving Quantitative Reasoning Problems with Language Models. In Neural Information Processing Systems, 2022

work page 2022

[36] [36]

R. Li, L. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy- Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Patel, D. Abulkha...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

S. Li, J. Chen, Y . Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y . Mao, W. Chen, and X. Yan. Explanations from Large Language Models Make Small Reasoners Better. Preprint arXiv:2210.06726, 2022

work page arXiv 2022

[38] [38]

X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han. DeepInception: Hypnotize Large Language Model to be Jailbreaker. Preprint arXiv:2311.03191, 2023

work page internal anchor Pith review arXiv 2023

[39] [39]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

work page 2024

[40] [40]

W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. Smith, J. Rehg, and L. Song. Iterative Machine Teaching. In International Conference on Machine Learning, 2017

work page 2017

[41] [41]

W. Liu, Z. Liu, H. Wang, L. Paull, B. Sch¨olkopf, and A. Weller. Iterative Teaching by Label Synthesis. In Neural Information Processing Systems, 2021. 12 Published as a conference paper at ICLR 2024

work page 2021

[42] [42]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[43] [43]

H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. Preprint arXiv:2308.09583, 2023

work page internal anchor Pith review arXiv 2023

[44] [44]

Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In International Conference on Learning Representations, 2024

work page 2024

[45] [45]

Magister, J

L. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn. Teaching Small Language Models to Reason. In Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[46] [46]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

M. Marion, A. ¨Ust¨un, L. Pozzobon, A. Wang, M. Fadaee, and S. Hooker. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. Preprint arXiv:2309.04564, 2023

work page arXiv 2023

[47] [47]

S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to Learn In Context. In North American Chapter of the Association for Computational Linguistics, 2022

work page 2022

[48] [48]

Mirzadeh, M

S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved Knowledge Distillation via Teacher Assistant. In AAAI Conference on Artificial Intelligence, 2020

work page 2020

[49] [49]

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

MosaicML. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Technical Report, 2023

work page 2023

[50] [50]

Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

Ariel N., Cole J., and Nataniel R. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. Preprint arXiv:2308.07317, 2023

work page arXiv 2023

[51] [51]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Preprint arXiv:2203.13474, 2022

work page internal anchor Pith review arXiv 2022

[52] [52]

OpenAI. GPT-3.5. Technical Report, 2022

work page 2022

[53] [53]

GPT-3.5-Turbo

OpenAI. GPT-3.5-Turbo. Technical Report, 2022

work page 2022

[54] [54]

OpenAI. GPT-4. Technical Report, 2023

work page 2023

[55] [55]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training Language Models to Follow Instructions with Human Feedback. In Neural Information Processing Systems, 2022

work page 2022

[56] [56]

W. Park, D. Kim, Y . Lu, and M. Cho. Relational Knowledge Distillation. InComputer Vision and Pattern Recognition, 2019

work page 2019

[57] [57]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Preprint arXiv:2306.01116, 2023

work page internal anchor Pith review arXiv 2023

[58] [58]

Z. Qiu, W. Liu, T. Xiao, Z. Liu, U. Bhatt, Y . Luo, A. Weller, and B. Sch ¨olkopf. Iterative Teaching by Data Hallucination. In Artificial Intelligence and Statistics, 2023

work page 2023

[59] [59]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical Report, 2019

work page 2019

[60] [60]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 2020. 13 Published as a conference paper at ICLR 2024

work page 2020

[61] [61]

Code Llama: Open Foundation Models for Code

B. Rozi`ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code Llama: Open Foundation Models for Code. Preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[63] [63]

P. Shen, X. Lu, S. Li, and H. Kawai. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification. In International Speech Communi- cation Association, 2018

work page 2018

[64] [64]

Shridhar, A

K. Shridhar, A. Stolfo, and M. Sachan. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics, 2023

work page 2023

[65] [65]

J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, et al. A Survey of Reasoning with Foundation Models. Preprint arXiv:2312.11562, 2023

work page arXiv 2023

[66] [66]

Talmor, J

A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[67] [67]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model. Technical report, 2023

work page 2023

[68] [68]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A Large Language Model for Science. Preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [69]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foundation Language Models. Preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Ferrer, M. Chen, G. Cucurull, D. Es- iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Kore...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Wang and A

B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Technical Report, 2021

work page 2021

[72] [72]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making Large Language Models Better Reasoners with Alignment. Preprint arXiv:2309.02144, 2023

work page arXiv 2023

[73] [73]

T. Wang, J. Zhu, A. Torralba, and A. Efros. Dataset Distillation. Preprint arXiv:1811.10959, 2018

work page internal anchor Pith review arXiv 2018

[74] [74]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

work page 2023

[75] [75]

J. Wei, X. Wang, D. Schuurmans, Maarten Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Neural Information Processing Systems, 2022

work page 2022

[76] [76]

Y . Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao. Large Language Models are Better Reasoners with Self-Verification. In Conference on Empirical Methods in Natural Language Processing, 2023. 14 Published as a conference paper at ICLR 2024

work page 2023

[77] [77]

H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, H. Liao, and X. Liang. Lego-Prover: Neural theorem proving with growing libraries. In International Conference on Learning Representations, 2024

work page 2024

[78] [78]

Xiong, Z

J. Xiong, Z. Li, C. Zheng, Z. Guo, Y . Yin, E. Xie, Z. Yang, Q. Cao, H. Wang, X. Han, J. Tang, C. Li, and X. Liang. DQ-LoRE: Dual queries with low rank approximation re-ranking for in-context learning. In International Conference on Learning Representations, 2024

work page 2024

[79] [79]

Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. Preprint arXiv:2308.01825, 2023

work page internal anchor Pith review arXiv 2023

[80] [80]

X. Yue, X. Qu, G. Zhang, Y . Fu, W. Huang, H. Sun, Y . Su, and W. Chen. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In International Conference on Learning Representations, 2024

work page 2024