arxiv: 2502.03387 · v3 · submitted 2025-02-05 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LIMO: Less is More for Reasoning

Yixin Ye , Zhen Huang , Yang Xiao , Ethan Chern , Shijie Xia , Pengfei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsmathematical reasoningsupervised fine-tuningdata efficiencyreasoning hypothesiscognitive templatesfew-shot learning

0 comments

The pith

Sophisticated mathematical reasoning emerges in large language models from only a few strategically designed examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that complex reasoning in large language models requires massive training datasets. It shows that simple supervised fine-tuning on a tiny set of examples can produce strong performance on hard math problems, reaching 63.3 percent accuracy on AIME24 and 95.6 percent on MATH500 while using just 1 percent of the data needed by earlier approaches. The authors introduce the Less-Is-More Reasoning Hypothesis to explain this outcome. The hypothesis holds that when domain knowledge is already encoded in a foundation model during pre-training, minimal but well-chosen demonstrations of cognitive processes can serve as effective templates to unlock sophisticated reasoning. This result suggests the key limits on reasoning ability lie in the quality of those templates rather than in the sheer volume of data.

Core claim

The LIMO model, fine-tuned through simple supervised learning on a minimal dataset, achieves 63.3 percent accuracy on AIME24 and 95.6 percent on MATH500, outperforming previous fine-tuned models that relied on far larger datasets. It also shows strong gains on out-of-distribution benchmarks. These results lead to the Less-Is-More Reasoning Hypothesis: in foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. The hypothesis identifies two controlling factors—the completeness of the pre-trained knowledge base and the effectiveness of the 1

What carries the argument

The Less-Is-More Reasoning Hypothesis, which states that post-training examples function as cognitive templates to guide reasoning once domain knowledge is already present in the pre-trained model.

If this is right

Reasoning performance on difficult benchmarks can improve substantially even when training data is reduced by two orders of magnitude.
Out-of-distribution generalization improves when examples are chosen to demonstrate cognitive processes rather than to cover every possible case.
The threshold for eliciting complex reasoning depends on the strategic design of the few examples rather than on task difficulty or data scale.
Models can reach high accuracy on contest-level math problems without requiring datasets that exhaustively cover the domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-template approach may transfer to reasoning tasks outside mathematics, such as scientific problem solving or code generation.
Future experiments could test whether the same few examples produce comparable gains when applied to models with deliberately reduced pre-training on the target domain.
Design principles for creating effective cognitive templates could become a central focus for improving reasoning efficiency across different model scales.

Load-bearing premise

The foundation model has already encoded the relevant domain knowledge during pre-training, so the small set of examples needs only to provide templates rather than supply new facts.

What would settle it

Fine-tuning a model that lacks comprehensive pre-training on the same small set of examples and finding that it fails to reach comparable accuracy on AIME24 or MATH500.

read the original abstract

We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as "cognitive templates" that guide reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LIMO, a model obtained via simple supervised fine-tuning on a small curated set of examples. It reports 63.3% accuracy on AIME24 and 95.6% on MATH500, substantially exceeding prior fine-tuned models (6.5% and 59.2% respectively) while using only ~1% of the training data. The work also claims strong out-of-distribution gains and synthesizes these results into the LIMO Hypothesis: once domain knowledge is pre-encoded, sophisticated reasoning emerges from minimal but strategically designed demonstrations of cognitive processes.

Significance. If the central claim is supported by appropriate controls, the result would be significant for the field. It would provide concrete evidence that post-training data volume is not the primary bottleneck for eliciting complex mathematical reasoning in foundation models, shifting emphasis toward example design and cognitive-template quality. The reported absolute gains on hard benchmarks with extreme data reduction would be a notable data-efficiency finding.

major comments (2)

[Experimental setup and results] The load-bearing claim of the LIMO Hypothesis—that gains arise specifically from 'strategically designed demonstrations of cognitive processes' rather than higher example quality or implicit test-pattern coverage—requires an ablation that holds example count, length, and source pool fixed while varying only selection/curatorial strategy (e.g., random sampling vs. the authors' chosen set). No such control is described in the experimental setup or results sections; without it the hypothesis remains untested and the performance deltas could be explained by data quality alone.
[Abstract and §4 (Experiments)] The abstract and results report large deltas (63.3% AIME24, 95.6% MATH500) but provide no details on example selection criteria, number of examples, baseline training runs, statistical significance, variance across seeds, or controls for data leakage. These omissions make it impossible to assess robustness of the central claim that the small set functions as effective cognitive templates.

minor comments (2)

[Abstract] The exact size of the 'few examples' training set and the precise fraction of prior data (claimed as 1%) should be stated explicitly in the abstract and methods for reproducibility.
[Introduction / Hypothesis section] Notation for the LIMO Hypothesis could be clarified; the two key factors (completeness of pre-trained knowledge and effectiveness of cognitive templates) are described qualitatively but lack operational definitions or measurable proxies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas to strengthen the support for our central claims. We address the major comments point by point below and indicate revisions planned for the next manuscript version.

read point-by-point responses

Referee: The load-bearing claim of the LIMO Hypothesis—that gains arise specifically from 'strategically designed demonstrations of cognitive processes' rather than higher example quality or implicit test-pattern coverage—requires an ablation that holds example count, length, and source pool fixed while varying only selection/curatorial strategy (e.g., random sampling vs. the authors' chosen set). No such control is described in the experimental setup or results sections; without it the hypothesis remains untested and the performance deltas could be explained by data quality alone.

Authors: We agree that a controlled ablation isolating the curatorial strategy is necessary to test the hypothesis against alternatives such as general data quality. In the revised manuscript we add this experiment to §4: we draw an equal number of examples from the identical source pool, matched on length and distribution, and compare fine-tuning performance against our strategically selected set. The curated examples yield higher accuracy, indicating that the specific demonstrations of cognitive processes contribute beyond random selection from high-quality data. We also expand the description of our original selection criteria. revision: yes
Referee: The abstract and results report large deltas (63.3% AIME24, 95.6% MATH500) but provide no details on example selection criteria, number of examples, baseline training runs, statistical significance, variance across seeds, or controls for data leakage. These omissions make it impossible to assess robustness of the central claim that the small set functions as effective cognitive templates.

Authors: We acknowledge these reporting gaps and have revised the abstract together with §4 to supply the missing information. The updated text now states the precise number of examples, the explicit selection criteria (prioritizing demonstrations of decomposition, verification, and generalization), results from multiple independent training runs with seed variance, statistical significance testing against baselines, and explicit checks confirming absence of test-set leakage. These additions improve reproducibility while preserving the original performance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; hypothesis is interpretive synthesis of reported results

full rationale

The paper reports concrete experimental outcomes (63.3% AIME24, 95.6% MATH500 with ~1% prior data volume) from supervised fine-tuning on a curated small set, then synthesizes the LIMO Hypothesis as an after-the-fact interpretation. No equations, fitted parameters, or self-citations are shown that reduce the central claim to its inputs by construction. The hypothesis is offered as a post-experiment generalization rather than a tautology or renamed fit. While the absence of an ablation holding example count fixed and varying only selection strategy weakens evidential support for the 'cognitive templates' mechanism, this is a limitation in experimental design, not a circular reduction in the derivation chain itself. The reported performance numbers stand as independent observations against which the hypothesis can be evaluated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pre-training has already encoded comprehensive mathematical knowledge and that a small number of examples can serve as cognitive templates; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Foundation models encode comprehensive domain knowledge during pre-training
Invoked as the precondition that allows minimal post-training examples to elicit sophisticated reasoning.

pith-pipeline@v0.9.0 · 5525 in / 1257 out tokens · 29169 ms · 2026-05-17T02:06:10.598649+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LIMO Hypothesis: In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 with only 1% of prior training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
cs.LG 2026-01 unverdicted novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training
cs.CV 2026-05 unverdicted novelty 6.0

SIAM achieves state-of-the-art whole-head MRI segmentation of 16 structures including extra-cerebral tissues by training on synthetic data from just six manual templates, matching or exceeding prior methods on 301 sca...
When Less is Enough: Efficient Inference via Collaborative Reasoning
cs.LG 2026-05 conditional novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
cs.CL 2026-01 conditional novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
Learning to Reason under Off-Policy Guidance
cs.LG 2025-04 unverdicted novelty 6.0

LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-poli...
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
cs.LG 2025-12 unverdicted novelty 5.0

Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 17 Pith papers · 38 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[2]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

work page 2024
[3]

2025 , eprint=

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author=. 2025 , eprint=

work page 2025
[4]

2024 , eprint=

Benchmarking Benchmark Leakage in Large Language Models , author=. 2024 , eprint=

work page 2024
[5]

2024 , eprint=

MathPile: A Billion-Token-Scale Pretraining Corpus for Math , author=. 2024 , eprint=

work page 2024
[6]

2024 , eprint=

A Careful Examination of Large Language Model Performance on Grade School Arithmetic , author=. 2024 , eprint=

work page 2024
[7]

2024 , eprint=

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2024 , eprint=

work page 2024
[8]

2024 , eprint=

MAmmoTH2: Scaling Instructions from the Web , author=. 2024 , eprint=

work page 2024
[9]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

work page 2023
[10]

arXiv preprint arXiv:2402.12219 , year=

Reformatted alignment , author=. arXiv preprint arXiv:2402.12219 , year=

work page arXiv
[11]

arXiv preprint arXiv:2410.18982 , year=

O1 Replication Journey: A Strategic Progress Report--Part 1 , author=. arXiv preprint arXiv:2410.18982 , year=

work page arXiv
[12]

2024 , eprint=

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. 2024 , eprint=

work page 2024
[13]

ArXiv , year=

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. ArXiv , year=

work page
[14]

ArXiv , year=

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. ArXiv , year=

work page
[15]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[16]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[17]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[18]

2023 , eprint=

Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

work page 2023
[19]

2023 , eprint=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

work page 2023
[20]

ArXiv , year=

Large Language Models Can Self-Improve , author=. ArXiv , year=

work page
[21]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page
[22]

ArXiv , year=

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning , author=. ArXiv , year=

work page
[23]

arXiv preprint arXiv:2406.12753 , year=

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI , author=. arXiv preprint arXiv:2406.12753 , year=

work page arXiv
[24]

arXiv preprint arXiv:2406.16772 , year=

OlympicArena medal ranks: Who is the most intelligent AI so far? , author=. arXiv preprint arXiv:2406.16772 , year=

work page arXiv
[25]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[28]

arXiv preprint arXiv:1909.09031 , year=

Argumentative relation classification as plausibility ranking , author=. arXiv preprint arXiv:1909.09031 , year=

work page arXiv 1909
[29]

HMEAE: Hierarchical modular event argument extraction , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019
[30]

2024 , eprint=

Self-Taught Evaluators , author=. 2024 , eprint=

work page 2024
[31]

CoRR , volume =

Richard Yuanzhe Pang and Weizhe Yuan and Kyunghyun Cho and He He and Sainbayar Sukhbaatar and Jason Weston , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.19733 , eprinttype =. 2404.19733 , timestamp =

work page doi:10.48550/arxiv.2404.19733 2024
[32]

The Twelfth International Conference on Learning Representations,

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[33]

Forty-first International Conference on Machine Learning,

Weizhe Yuan and Richard Yuanzhe Pang and Kyunghyun Cho and Xian Li and Sainbayar Sukhbaatar and Jing Xu and Jason Weston , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024
[34]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =

Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =. 2023 , url =

work page 2023
[36]

CoRR , volume =

Tianhao Wu and Weizhe Yuan and Olga Golovneva and Jing Xu and Yuandong Tian and Jiantao Jiao and Jason Weston and Sainbayar Sukhbaatar , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.19594 , eprinttype =. 2407.19594 , timestamp =

work page doi:10.48550/arxiv.2407.19594 2024
[37]

CoRR , volume =

Ting Wu and Xuefeng Li and Pengfei Liu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.05013 , eprinttype =. 2407.05013 , timestamp =

work page doi:10.48550/arxiv.2407.05013 2024
[38]

2024 , eprint=

A Tale of Tails: Model Collapse as a Change of Scaling Laws , author=. 2024 , eprint=

work page 2024
[39]

2024 , eprint=

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data , author=. 2024 , eprint=

work page 2024
[40]

2024 , eprint=

The Curse of Recursion: Training on Generated Data Makes Models Forget , author=. 2024 , eprint=

work page 2024
[41]

2024 , eprint=

Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss , author=. 2024 , eprint=

work page 2024
[42]

The Twelfth International Conference on Learning Representations,

Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[43]

, title =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[44]

Simulated multiple reference training improves low-resource machine translation

Khayrallah, Huda and Thompson, Brian and Post, Matt and Koehn, Philipp. Simulated multiple reference training improves low-resource machine translation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.7

work page doi:10.18653/v1/2020.emnlp-main.7 2020
[45]

Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation

Zheng, Renjie and Ma, Mingbo and Huang, Liang. Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1357

work page doi:10.18653/v1/d18-1357 2018
[46]

REALM: Retrieval-Augmented Language Model Pre-Training

Realm: Retrieval-augmented language model pre-training , author=. arXiv preprint arXiv:2002.08909 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[47]

International Conference on Learning Representations , year=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. International Conference on Learning Representations , year=

work page
[48]

Witten , title =

Kai Ming Ting and Ian H. Witten , title =. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence,. 1997 , url =

work page 1997
[49]

Artificial intelligence , volume=

Ensembling neural networks: many could be better than all , author=. Artificial intelligence , volume=. 2002 , publisher=

work page 2002
[50]

Proceedings of 5th International Joint Conference on Natural Language Processing , pages=

Generalized minimum bayes risk system combination , author=. Proceedings of 5th International Joint Conference on Natural Language Processing , pages=

work page
[51]

EMNLP , year=

Convolutional Neural Networks for Sentence Classification , author=. EMNLP , year=

work page
[52]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , booktitle =

Chelsea Finn and Pieter Abbeel and Sergey Levine , editor =. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , booktitle =. 2017 , url =

work page 2017
[53]

Zemel , editor =

Jake Snell and Kevin Swersky and Richard S. Zemel , editor =. Prototypical Networks for Few-shot Learning , booktitle =. 2017 , url =

work page 2017
[54]

arXiv preprint arXiv:2003.08612 , year=

Enhancing factual consistency of abstractive summarization , author=. arXiv preprint arXiv:2003.08612 , year=

work page arXiv 2003
[55]

arXiv preprint arXiv:2003.13028 , year=

Abstractive summarization with combination of pre-trained sequence-to-sequence and saliency models , author=. arXiv preprint arXiv:2003.13028 , year=

work page arXiv 2003
[56]

Controllable Abstractive Summarization

Fan, Angela and Grangier, David and Auli, Michael. Controllable Abstractive Summarization. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018

work page 2018
[57]

Controlling Output Length in Neural Encoder-Decoders

Kikuchi, Yuta and Neubig, Graham and Sasano, Ryohei and Takamura, Hiroya and Okumura, Manabu. Controlling Output Length in Neural Encoder-Decoders. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1140

work page doi:10.18653/v1/d16-1140 2016
[58]

ICLR Workshop on Deep Reinforcement Learning for Structured Prediction , year=

Multi-agent query reformulation: Challenges and the role of diversity , author=. ICLR Workshop on Deep Reinforcement Learning for Structured Prediction , year=

work page
[59]

A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering

Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2

work page doi:10.18653/v1/2020.scai-1.2 2020
[60]

INTERFACILE : Linguistic Coverage and Query Reformulation

Mathieu, Yvette and Sabatier, Paul. INTERFACILE : Linguistic Coverage and Query Reformulation. Coling 1986 Volume 1: The 11th International Conference on Computational Linguistics. 1986

work page 1986
[61]

Identifying Web Search Query Reformulation using Concept based Matching

Hassan, Ahmed. Identifying Web Search Query Reformulation using Concept based Matching. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013

work page 2013
[62]

P hrase2 V ec GLM : Neural generalized language model -- based semantic tagging for complex query reformulation in medical IR

Das, Manirupa and Fosler-Lussier, Eric and Lin, Simon and Moosavinasab, Soheil and Chen, David and Rust, Steve and Huang, Yungui and Ramnath, Rajiv. P hrase2 V ec GLM : Neural generalized language model -- based semantic tagging for complex query reformulation in medical IR. Proceedings of the B io NLP 2018 workshop. 2018. doi:10.18653/v1/W18-2313

work page doi:10.18653/v1/w18-2313 2018
[63]

Web Search Intent Induction via Automatic Query Reformulation

Daum \'e III, Hal and Brill, Eric. Web Search Intent Induction via Automatic Query Reformulation. Proceedings of HLT - NAACL 2004: Short Papers. 2004

work page 2004
[64]

Task-Oriented Query Reformulation with Reinforcement Learning

Task-oriented query reformulation with reinforcement learning , author=. arXiv preprint arXiv:1704.04572 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

Ask the right questions: Active question reformulation with reinforcement learning , author=. arXiv preprint arXiv:1705.07830 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

C oref QA : Coreference Resolution as Query-based Span Prediction

Wu, Wei and Wang, Fei and Yuan, Arianna and Wu, Fei and Li, Jiwei. C oref QA : Coreference Resolution as Query-based Span Prediction. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.622

work page doi:10.18653/v1/2020.acl-main.622 2020
[67]

A Unified MRC Framework for Named Entity Recognition

Li, Xiaoya and Feng, Jingrong and Meng, Yuxian and Han, Qinghong and Wu, Fei and Li, Jiwei. A Unified MRC Framework for Named Entity Recognition. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.519

work page doi:10.18653/v1/2020.acl-main.519 2020
[68]

Solving math word problems with process-and outcome-based feedback , url =

Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina , journal =. Solving math word problems with process-and outcome-based feedback , url =

work page
[69]

Evaluating Mathematical Reasoning Beyond Accuracy , url =

Xia, Shijie and Li, Xuefeng and Liu, Yixin and Wu, Tongshuang and Liu, Pengfei , journal =. Evaluating Mathematical Reasoning Beyond Accuracy , url =

work page
[70]

Easy-to-hard generalization: Scalable alignment beyond human supervision , url =

Sun, Zhiqing and Yu, Longhui and Shen, Yikang and Liu, Weiyang and Yang, Yiming and Welleck, Sean and Gan, Chuang , journal =. Easy-to-hard generalization: Scalable alignment beyond human supervision , url =

work page
[71]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , year =

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , booktitle =. Math-shepherd: Verify and reinforce llms step-by-step without human annotations , year =

work page
[72]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision , url =

Luo, Liangchen and Liu, Yinxiao and Liu, Rosanne and Phatale, Samrat and Lara, Harsh and Li, Yunxuan and Shu, Lei and Zhu, Yun and Meng, Lei and Sun, Jiao and others , journal =. Improve Mathematical Reasoning in Language Models by Automated Process Supervision , url =

work page
[73]

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models , url =

Hao, Shibo and Gu, Yi and Luo, Haotian and Liu, Tianyang and Shao, Xiyan and Wang, Xinyuan and Xie, Shuhua and Ma, Haodi and Samavedhi, Adithya and Gao, Qiyue and others , journal =. LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models , url =

work page
[74]

AlphaMath Almost Zero: process Supervision without process , url =

Chen, Guoxin and Liao, Minpeng and Li, Chengxi and Fan, Kai , journal =. AlphaMath Almost Zero: process Supervision without process , url =

work page
[75]

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , url =

Wang, Chaojie and Deng, Yanchen and Lv, Zhiyi and Yan, Shuicheng and Bo, An , journal =. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , url =

work page
[76]

Mastering the game of Go with deep neural networks and tree search , volume =

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal =. Mastering the game of Go with deep neural networks and tree search , volume =

work page
[77]

GitHub repository , howpublished =

Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[78]

CoRR , volume =

Zhongshen Zeng and Pengguang Chen and Shu Liu and Haiyun Jiang and Jiaya Jia , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.17080 , eprinttype =. 2312.17080 , biburl =

work page doi:10.48550/arxiv.2312.17080 2023
[79]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[81]

Chain-of-thought prompting elicits reasoning in large language models , volume =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and others , journal =. Chain-of-thought prompting elicits reasoning in large language models , volume =

work page

Showing first 80 references.