pith. sign in

arxiv: 2309.12284 · v4 · submitted 2023-09-21 · 💻 cs.CL · cs.AI

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Pith reviewed 2026-05-13 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords mathematical reasoninglarge language modelsfine-tuningdata augmentationGSM8KMATHLLaMA-2question rewriting
0
0 comments X

The pith

Rewriting existing math questions from multiple perspectives lets fine-tuned LLaMA-2 models reach 66.4 percent on GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that taking standard math problems and rewriting each one several times from fresh angles creates a more effective training set called MetaMathQA. Fine-tuning LLaMA-2 models on this set produces large accuracy jumps on GSM8K and MATH without adding any new external facts or problems. The 7B version hits 66.4 percent on GSM8K and 19.4 percent on MATH, beating earlier open-source models of the same size by double-digit margins. Even the 70B version slightly exceeds GPT-3.5-Turbo on GSM8K. The approach treats the bottleneck in mathematical reasoning as insufficient variety in how problems are presented rather than insufficient raw data volume.

Core claim

By rewriting each original mathematical question from multiple distinct perspectives without introducing external knowledge, the authors create the MetaMathQA dataset that, when used to fine-tune LLaMA-2, produces models with substantially stronger mathematical reasoning capabilities, as measured by accuracy on GSM8K and MATH benchmarks.

What carries the argument

The bootstrapping process of rewriting each question from multiple perspectives to generate diverse training examples in MetaMathQA.

Load-bearing premise

Rewriting questions from multiple perspectives produces sufficiently diverse, high-quality, and non-redundant examples that improve actual reasoning rather than merely increasing data volume.

What would settle it

Train two models on identical numbers of examples, one using the perspective-rewriting process and one using simple duplication or random rephrasing, then compare their GSM8K and MATH scores.

read the original abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MetaMath, a fine-tuning approach for LLaMA-2 models that first bootstraps a new dataset (MetaMathQA) by rewriting existing mathematical questions from multiple perspectives without introducing external knowledge, then trains on this augmented data. It reports large gains on GSM8K (66.4% for 7B, 82.3% for 70B) and MATH (19.4% for 7B), exceeding prior open-source models of comparable size by 11.5% and 8.7% respectively.

Significance. If the performance lift is shown to arise from the diversity and quality of the multi-perspective rewrites rather than from simply increasing training volume, the method supplies a low-cost, knowledge-free data-augmentation recipe that could be applied to other reasoning domains and would materially narrow the gap between open-source and closed-source mathematical reasoning models.

major comments (3)
  1. [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.
  2. [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.
  3. [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.
minor comments (2)
  1. [Figure 1] Figure 1 caption and surrounding text use inconsistent terminology ('forward' vs. 'backward' rewriting) that is never formally defined.
  2. [Abstract] The abstract states that MetaMath-70B is 'slightly better than GPT-3.5-Turbo' on GSM8K, but the main text does not report the exact GPT-3.5-Turbo score used for this comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of reproducibility, experimental controls, and statistical reporting that will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (MetaMathQA Construction): the rewriting procedure is described only at a high level; the paper does not specify the exact prompts, the number of rewrites per seed question, or any automated filters for mathematical validity or non-redundancy, preventing independent reproduction of the claimed data quality.

    Authors: We agree that additional details are required for full reproducibility. In the revised manuscript we will add the exact prompts used for multi-perspective rewriting in an appendix. We generate four rewrites per seed question. For quality control we apply an automated filter that solves both the original and rewritten questions with a symbolic solver and discards any pair whose answers differ; we also remove near-duplicates via embedding similarity. These steps will be described in detail. revision: yes

  2. Referee: [Section 4.2] Section 4.2 and Table 2: no ablation holds total training tokens or example count fixed while varying the rewrite strategy (e.g., MetaMathQA vs. duplicated original GSM8K/MATH vs. random paraphrases). Without this control, the 11.5% GSM8K and 8.7% MATH gains cannot be attributed to multi-perspective rewriting rather than increased data volume.

    Authors: We acknowledge that a volume-controlled ablation is necessary to isolate the contribution of multi-perspective rewriting. We will add this experiment in the revised version: we train on (i) the original GSM8K/MATH data duplicated to match the example count of MetaMathQA, (ii) random paraphrases generated with the same model and prompt style but without the multi-perspective instruction, and (iii) MetaMathQA itself, keeping total training tokens fixed. Results will be reported in an updated Table 2. revision: yes

  3. Referee: [Section 4.3] Section 4.3: the comparison tables report single-run accuracies without error bars, multiple random seeds, or statistical significance tests, which is especially problematic when claiming large margins over prior SOTA models of the same size.

    Authors: We agree that reporting variance would increase confidence in the results. Due to compute limits we performed single runs for the 70B model; for the 7B model we will rerun with three random seeds and report mean and standard deviation. We will also add a brief discussion noting that the observed margins (11.5% and 8.7%) substantially exceed typical run-to-run variance observed in similar fine-tuning settings. These changes will appear in Section 4.3 and the tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical augmentation evaluated on external benchmarks

full rationale

The paper describes an empirical pipeline—rewriting existing math questions from multiple perspectives to create MetaMathQA, then fine-tuning LLaMA-2 models on the resulting dataset and measuring accuracy on the fixed external benchmarks GSM8K and MATH. No equations, fitted parameters, or self-referential quantities are presented as predictions. No load-bearing self-citations or uniqueness theorems are invoked. The central results are direct performance numbers on independent test sets after training, with no reduction of any claimed derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests on standard supervised fine-tuning and data-augmentation assumptions.

pith-pipeline@v0.9.0 · 5570 in / 1136 out tokens · 48427 ms · 2026-05-13T10:00:51.493241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.

  2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  3. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  4. Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

    cs.LG 2025-07 unverdicted novelty 7.0

    An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  6. FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

    cs.LG 2026-05 unverdicted novelty 6.0

    FuRA uses block tensor-train factorization with fixed pretrained SVD basis to achieve full-rank spectral preconditioning, outperforming Full FT by +1.37 on LLaMA-3-8B commonsense reasoning and surpassing QLoRA in quan...

  7. Self-Supervised On-Policy Distillation for Reasoning Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...

  8. Generating Leakage-Free Benchmarks for Robust RAG Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.

  9. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  10. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 6.0

    RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

  11. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  12. Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

    cs.AI 2026-05 unverdicted novelty 6.0

    JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...

  13. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  14. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  15. HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.

  16. Sensitivity-Positional Co-Localization in GQA Transformers

    cs.CL 2026-04 unverdicted novelty 6.0

    In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...

  17. Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

    cs.CL 2026-02 unverdicted novelty 6.0

    A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.

  18. Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

    cs.LG 2026-02 unverdicted novelty 6.0

    Extra-CoT trains a semantic compressor on math CoT data, applies mixed-ratio SFT, and uses CHRPO reinforcement learning to achieve over 73% token reduction on MATH-500 with 0.6% accuracy gain on Qwen3-1.7B.

  19. Multi-Token Prediction via Self-Distillation

    cs.CL 2026-02 unverdicted novelty 6.0

    Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.

  20. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  21. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.

  22. InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

    cs.CL 2025-08 unverdicted novelty 6.0

    InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL...

  23. League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

    cs.AI 2025-07 unverdicted novelty 6.0

    League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

  24. MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

    cs.LG 2025-06 conditional novelty 6.0

    MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.

  25. FoNE: Precise Single-Token Number Embeddings via Fourier Features

    cs.CL 2025-02 unverdicted novelty 6.0

    FoNE encodes numbers as single tokens via Fourier features and outperforms subword and digit-wise embeddings on addition, subtraction, and multiplication with far less data.

  26. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  27. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  28. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    cs.LG 2024-06 conditional novelty 6.0

    Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.

  29. Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    cs.CL 2024-06 conditional novelty 6.0

    OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

  30. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    cs.CL 2024-02 unverdicted novelty 6.0

    DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.

  31. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  32. Llemma: An Open Language Model For Mathematics

    cs.CL 2023-10 unverdicted novelty 6.0

    Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

  33. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    cs.CL 2023-09 conditional novelty 6.0

    MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

  34. LoCO: Low-rank Compositional Rotation Fine-tuning

    cs.LG 2026-05 unverdicted novelty 5.0

    LoCO is a PEFT technique that constructs orthogonal transformations via low-rank skew-symmetric matrices and compositional rotation chains with a parallelizable approximation, validated on transformer adaptations.

  35. Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    LoRA-Over injects auxiliary parameters into low-rank adapters during training and decomposes them back into standard LoRA at inference, with static or dynamic scheduling to allocate extra capacity where needed, yieldi...

  36. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  37. Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    cs.LG 2026-05 unverdicted novelty 5.0

    NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

  38. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  39. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

  40. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.

  41. Post-Optimization Adaptive Rank Allocation for LoRA

    cs.AI 2026-04 unverdicted novelty 5.0

    PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.

  42. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  43. Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

    cs.LG 2025-12 unverdicted novelty 5.0

    A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.

  44. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

    cs.LG 2025-12 unverdicted novelty 5.0

    Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

  45. BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

    cs.LG 2025-09 unverdicted novelty 5.0

    BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a...

  46. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  47. Efficient Reasoning with Hidden Thinking

    cs.CL 2025-01 unverdicted novelty 5.0

    Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.

  48. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  49. Training and Evaluating Language Models with Template-based Data Generation

    cs.CL 2024-11 unverdicted novelty 5.0

    TDG uses GPT-4 to generate meta-templates that synthesize over 7 million verifiable grade school math problems for training and aligning LLMs on reasoning tasks.

  50. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  51. Rethinking Wireless Communications through Formal Mathematical AI Reasoning

    eess.SP 2026-04 unverdicted novelty 4.0

    Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.

  52. Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

    cs.CL 2026-04 unverdicted novelty 4.0

    AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.

  53. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  54. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  55. Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    cs.CL 2025-02 unverdicted novelty 2.0

    Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 52 Pith papers · 24 internal anchors

  1. [1]

    Alibaba. Qwen-7b. Technical Report, 2023

  2. [2]

    R. Anil, A. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catast...

  3. [3]

    Azerbayev, H

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. Dos, S. McAleer, A. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics. In International Conference on Learning Representations, 2024

  4. [4]

    Baichuan 2

    BaichuanInc. Baichuan 2. Technical Report, 2023

  5. [5]

    A is B” Fail to Learn “B is A

    L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. Stickland, T. Korbak, and O. Evans. The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”. InInternational Conference on Learning Representations, 2024

  6. [6]

    J. Bilmes. Submodularity In Machine Learning and Artificial Intelligence. Preprint arXiv:2202.00132, 2022

  7. [7]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V oss, W. Gus...

  9. [9]

    W. Chen, X. Ma, X. Wang, and W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Preprint arXiv:2211.12588, 2022. 10 Published as a conference paper at ICLR 2024

  10. [10]

    Y . Chen, R. Zhong, S. Zha, G. Karypis, and H. He. Meta-learning via Language Model In-context Tuning. In Annual Meeting of the Association for Computational Linguistics, 2022

  11. [11]

    Chiang, Z

    W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. Gonzalez, I. Stoica, and E. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Technical Report, 2023

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H....

  13. [13]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168, 2021

  14. [14]

    Collins, Albert Q

    K. Collins, A. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y . Wu, J. Tenen- baum, W. Hart, T. Gowers, W. Li, A. Weller, and M. Jamnik. Evaluating Language Models for Mathematics through Interactions. Preprint arXiv:2306.01694, 2023

  15. [15]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized llms. Preprint arXiv:2305.14314, 2023

  16. [16]

    Devlin, M

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics, 2019

  17. [17]

    D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In North American Chapter of the Association for Computational Linguistics, 2019

  18. [18]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    R. Eldan and Y . Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Preprint arXiv:2305.07759, 2023

  19. [19]

    Y . Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing Smaller Language Models towards Multi-Step Reasoning. In International Conference on Machine Learning, 2023

  20. [20]

    Y . Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot. Complexity-Based Prompting for Multi- step Reasoning. In International Conference on Learning Representations, 2023

  21. [21]

    J. Gou, B. Yu, S. Maybank, and D. Tao. Knowledge Distillation: A Survey. International Journal of Computer Vision, 2021

  22. [22]

    T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y . Yan. Knowledge Adaptation for Efficient Semantic Segmentation. In Computer Vision and Pattern Recognition, 2019

  23. [23]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In Neural Information Processing Systems: Datasets and Benchmarks, 2021

  24. [24]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531, 2015

  25. [25]

    N. Ho, L. Schmid, and S. Yun. Large Language Models Are Reasoning Teachers. In Annual Meeting of the Association for Computational Linguistics, 2023. 11 Published as a conference paper at ICLR 2024

  26. [26]

    Hsieh, C

    C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Krishna, C. Lee, and T. Pfister. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Annual Meeting of the Association for Computational Linguistics , 2023

  27. [27]

    Large Language Models Can Self-Improve

    J. Huang, S. Gu, L. Hou, Y . Wu, X. Wang, H. Yu, and J. Han. Large Language Models Can Self-Improve. Preprint arXiv:2210.11610, 2022

  28. [28]

    Imani, L

    S. Imani, L. Du, and H. Shrivastava. MathPrompter: Mathematical Reasoning using Large Language Models. In Annual Meeting of the Association for Computational Linguistics, 2023

  29. [29]

    InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities

    InternLM. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. Technical Report, 2023

  30. [30]

    Mistral 7B

    A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, F. Bressand D. Casas, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M. Lachaux, P. Stock, T. Scao, T. Lavril, T. Wang, and T. Lacroixand W. Sayed. Mistral 7B. Preprint arXiv:2310.06825, 2023

  31. [31]

    Jiang, B

    W. Jiang, B. Lin, H. Shi, Y . Zhang, Z. Li, and J. Kwok. BYOM: Building Your Own Multi-Task Model for Free. Preprint arXiv:2310.01886, 2023

  32. [32]

    Backward reasoning in large language models for verification

    W. Jiang, H. Shi, L. Yu, Z. Liu, Y . Zhang, Z. Li, and J. Kwok. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. Preprint arXiv:2308.07758, 2023

  33. [33]

    Jiang, Y

    W. Jiang, Y . Zhang, and J. Kwok. Effective Structured-Prompting by Meta-Learning and Representitive Verbalizer. In International Conference on Machine Learning, 2023

  34. [34]

    Kilbertus, G

    N. Kilbertus, G. Parascandolo, and B. Sch¨olkopf. Generalization in anti-causal learning. Preprint arXiv:1812.00524, 2018

  35. [35]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra. Solving Quantitative Reasoning Problems with Language Models. In Neural Information Processing Systems, 2022

  36. [36]

    R. Li, L. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy- Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. Patel, D. Abulkha...

  37. [37]

    S. Li, J. Chen, Y . Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y . Mao, W. Chen, and X. Yan. Explanations from Large Language Models Make Small Reasoners Better. Preprint arXiv:2210.06726, 2022

  38. [38]

    X. Li, Z. Zhou, J. Zhu, J. Yao, T. Liu, and B. Han. DeepInception: Hypnotize Large Language Model to be Jailbreaker. Preprint arXiv:2311.03191, 2023

  39. [39]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. InInternational Conference on Learning Representations, 2024

  40. [40]

    W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. Smith, J. Rehg, and L. Song. Iterative Machine Teaching. In International Conference on Machine Learning, 2017

  41. [41]

    W. Liu, Z. Liu, H. Wang, L. Paull, B. Sch¨olkopf, and A. Weller. Iterative Teaching by Label Synthesis. In Neural Information Processing Systems, 2021. 12 Published as a conference paper at ICLR 2024

  42. [42]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint arXiv:1907.11692, 2019

  43. [43]

    H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. Preprint arXiv:2308.09583, 2023

  44. [44]

    Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In International Conference on Learning Representations, 2024

  45. [45]

    Magister, J

    L. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn. Teaching Small Language Models to Reason. In Annual Meeting of the Association for Computational Linguistics, 2023

  46. [46]

    When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

    M. Marion, A. ¨Ust¨un, L. Pozzobon, A. Wang, M. Fadaee, and S. Hooker. When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. Preprint arXiv:2309.04564, 2023

  47. [47]

    S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to Learn In Context. In North American Chapter of the Association for Computational Linguistics, 2022

  48. [48]

    Mirzadeh, M

    S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved Knowledge Distillation via Teacher Assistant. In AAAI Conference on Artificial Intelligence, 2020

  49. [49]

    Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

    MosaicML. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Technical Report, 2023

  50. [50]

    Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

    Ariel N., Cole J., and Nataniel R. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. Preprint arXiv:2308.07317, 2023

  51. [51]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Preprint arXiv:2203.13474, 2022

  52. [52]

    OpenAI. GPT-3.5. Technical Report, 2022

  53. [53]

    GPT-3.5-Turbo

    OpenAI. GPT-3.5-Turbo. Technical Report, 2022

  54. [54]

    OpenAI. GPT-4. Technical Report, 2023

  55. [55]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training Language Models to Follow Instructions with Human Feedback. In Neural Information Processing Systems, 2022

  56. [56]

    W. Park, D. Kim, Y . Lu, and M. Cho. Relational Knowledge Distillation. InComputer Vision and Pattern Recognition, 2019

  57. [57]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Preprint arXiv:2306.01116, 2023

  58. [58]

    Z. Qiu, W. Liu, T. Xiao, Z. Liu, U. Bhatt, Y . Luo, A. Weller, and B. Sch ¨olkopf. Iterative Teaching by Data Hallucination. In Artificial Intelligence and Statistics, 2023

  59. [59]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical Report, 2019

  60. [60]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 2020. 13 Published as a conference paper at ICLR 2024

  61. [61]

    Code Llama: Open Foundation Models for Code

    B. Rozi`ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code Llama: Open Foundation Models for Code. Preprint arXiv:2308.12950, 2023

  62. [62]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347, 2017

  63. [63]

    P. Shen, X. Lu, S. Li, and H. Kawai. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification. In International Speech Communi- cation Association, 2018

  64. [64]

    Shridhar, A

    K. Shridhar, A. Stolfo, and M. Sachan. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics, 2023

  65. [65]

    J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, et al. A Survey of Reasoning with Foundation Models. Preprint arXiv:2312.11562, 2023

  66. [66]

    Talmor, J

    A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In North American Chapter of the Association for Computational Linguistics, 2019

  67. [67]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model. Technical report, 2023

  68. [68]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A Large Language Model for Science. Preprint arXiv:2211.09085, 2022

  69. [69]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and Efficient Foundation Language Models. Preprint arXiv:2302.13971, 2023

  70. [70]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Ferrer, M. Chen, G. Cucurull, D. Es- iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Kore...

  71. [71]

    Wang and A

    B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Technical Report, 2021

  72. [72]

    P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y . Cao, T. Liu, and Z. Sui. Making Large Language Models Better Reasoners with Alignment. Preprint arXiv:2309.02144, 2023

  73. [73]

    T. Wang, J. Zhu, A. Torralba, and A. Efros. Dataset Distillation. Preprint arXiv:1811.10959, 2018

  74. [74]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations, 2023

  75. [75]

    J. Wei, X. Wang, D. Schuurmans, Maarten Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Neural Information Processing Systems, 2022

  76. [76]

    Y . Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and J. Zhao. Large Language Models are Better Reasoners with Self-Verification. In Conference on Empirical Methods in Natural Language Processing, 2023. 14 Published as a conference paper at ICLR 2024

  77. [77]

    H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, H. Liao, and X. Liang. Lego-Prover: Neural theorem proving with growing libraries. In International Conference on Learning Representations, 2024

  78. [78]

    Xiong, Z

    J. Xiong, Z. Li, C. Zheng, Z. Guo, Y . Yin, E. Xie, Z. Yang, Q. Cao, H. Wang, X. Han, J. Tang, C. Li, and X. Liang. DQ-LoRE: Dual queries with low rank approximation re-ranking for in-context learning. In International Conference on Learning Representations, 2024

  79. [79]

    Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. Preprint arXiv:2308.01825, 2023

  80. [80]

    X. Yue, X. Qu, G. Zhang, Y . Fu, W. Huang, H. Sun, Y . Su, and W. Chen. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In International Conference on Learning Representations, 2024

Showing first 80 references.