Recognition: 2 theorem links
· Lean TheoremLIMO: Less is More for Reasoning
Pith reviewed 2026-05-17 02:06 UTC · model grok-4.3
The pith
Sophisticated mathematical reasoning emerges in large language models from only a few strategically designed examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The LIMO model, fine-tuned through simple supervised learning on a minimal dataset, achieves 63.3 percent accuracy on AIME24 and 95.6 percent on MATH500, outperforming previous fine-tuned models that relied on far larger datasets. It also shows strong gains on out-of-distribution benchmarks. These results lead to the Less-Is-More Reasoning Hypothesis: in foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. The hypothesis identifies two controlling factors—the completeness of the pre-trained knowledge base and the effectiveness of the 1
What carries the argument
The Less-Is-More Reasoning Hypothesis, which states that post-training examples function as cognitive templates to guide reasoning once domain knowledge is already present in the pre-trained model.
If this is right
- Reasoning performance on difficult benchmarks can improve substantially even when training data is reduced by two orders of magnitude.
- Out-of-distribution generalization improves when examples are chosen to demonstrate cognitive processes rather than to cover every possible case.
- The threshold for eliciting complex reasoning depends on the strategic design of the few examples rather than on task difficulty or data scale.
- Models can reach high accuracy on contest-level math problems without requiring datasets that exhaustively cover the domain.
Where Pith is reading between the lines
- The same minimal-template approach may transfer to reasoning tasks outside mathematics, such as scientific problem solving or code generation.
- Future experiments could test whether the same few examples produce comparable gains when applied to models with deliberately reduced pre-training on the target domain.
- Design principles for creating effective cognitive templates could become a central focus for improving reasoning efficiency across different model scales.
Load-bearing premise
The foundation model has already encoded the relevant domain knowledge during pre-training, so the small set of examples needs only to provide templates rather than supply new facts.
What would settle it
Fine-tuning a model that lacks comprehensive pre-training on the same small set of examples and finding that it fails to reach comparable accuracy on AIME24 or MATH500.
read the original abstract
We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as "cognitive templates" that guide reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LIMO, a model obtained via simple supervised fine-tuning on a small curated set of examples. It reports 63.3% accuracy on AIME24 and 95.6% on MATH500, substantially exceeding prior fine-tuned models (6.5% and 59.2% respectively) while using only ~1% of the training data. The work also claims strong out-of-distribution gains and synthesizes these results into the LIMO Hypothesis: once domain knowledge is pre-encoded, sophisticated reasoning emerges from minimal but strategically designed demonstrations of cognitive processes.
Significance. If the central claim is supported by appropriate controls, the result would be significant for the field. It would provide concrete evidence that post-training data volume is not the primary bottleneck for eliciting complex mathematical reasoning in foundation models, shifting emphasis toward example design and cognitive-template quality. The reported absolute gains on hard benchmarks with extreme data reduction would be a notable data-efficiency finding.
major comments (2)
- [Experimental setup and results] The load-bearing claim of the LIMO Hypothesis—that gains arise specifically from 'strategically designed demonstrations of cognitive processes' rather than higher example quality or implicit test-pattern coverage—requires an ablation that holds example count, length, and source pool fixed while varying only selection/curatorial strategy (e.g., random sampling vs. the authors' chosen set). No such control is described in the experimental setup or results sections; without it the hypothesis remains untested and the performance deltas could be explained by data quality alone.
- [Abstract and §4 (Experiments)] The abstract and results report large deltas (63.3% AIME24, 95.6% MATH500) but provide no details on example selection criteria, number of examples, baseline training runs, statistical significance, variance across seeds, or controls for data leakage. These omissions make it impossible to assess robustness of the central claim that the small set functions as effective cognitive templates.
minor comments (2)
- [Abstract] The exact size of the 'few examples' training set and the precise fraction of prior data (claimed as 1%) should be stated explicitly in the abstract and methods for reproducibility.
- [Introduction / Hypothesis section] Notation for the LIMO Hypothesis could be clarified; the two key factors (completeness of pre-trained knowledge and effectiveness of cognitive templates) are described qualitatively but lack operational definitions or measurable proxies.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas to strengthen the support for our central claims. We address the major comments point by point below and indicate revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: The load-bearing claim of the LIMO Hypothesis—that gains arise specifically from 'strategically designed demonstrations of cognitive processes' rather than higher example quality or implicit test-pattern coverage—requires an ablation that holds example count, length, and source pool fixed while varying only selection/curatorial strategy (e.g., random sampling vs. the authors' chosen set). No such control is described in the experimental setup or results sections; without it the hypothesis remains untested and the performance deltas could be explained by data quality alone.
Authors: We agree that a controlled ablation isolating the curatorial strategy is necessary to test the hypothesis against alternatives such as general data quality. In the revised manuscript we add this experiment to §4: we draw an equal number of examples from the identical source pool, matched on length and distribution, and compare fine-tuning performance against our strategically selected set. The curated examples yield higher accuracy, indicating that the specific demonstrations of cognitive processes contribute beyond random selection from high-quality data. We also expand the description of our original selection criteria. revision: yes
-
Referee: The abstract and results report large deltas (63.3% AIME24, 95.6% MATH500) but provide no details on example selection criteria, number of examples, baseline training runs, statistical significance, variance across seeds, or controls for data leakage. These omissions make it impossible to assess robustness of the central claim that the small set functions as effective cognitive templates.
Authors: We acknowledge these reporting gaps and have revised the abstract together with §4 to supply the missing information. The updated text now states the precise number of examples, the explicit selection criteria (prioritizing demonstrations of decomposition, verification, and generalization), results from multiple independent training runs with seed variance, statistical significance testing against baselines, and explicit checks confirming absence of test-set leakage. These additions improve reproducibility while preserving the original performance numbers. revision: yes
Circularity Check
No significant circularity detected; hypothesis is interpretive synthesis of reported results
full rationale
The paper reports concrete experimental outcomes (63.3% AIME24, 95.6% MATH500 with ~1% prior data volume) from supervised fine-tuning on a curated small set, then synthesizes the LIMO Hypothesis as an after-the-fact interpretation. No equations, fitted parameters, or self-citations are shown that reduce the central claim to its inputs by construction. The hypothesis is offered as a post-experiment generalization rather than a tautology or renamed fit. While the absence of an ablation holding example count fixed and varying only selection strategy weakens evidential support for the 'cognitive templates' mechanism, this is a limitation in experimental design, not a circular reduction in the derivation chain itself. The reported performance numbers stand as independent observations against which the hypothesis can be evaluated.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Foundation models encode comprehensive domain knowledge during pre-training
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LIMO Hypothesis: In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 with only 1% of prior training data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training
SIAM achieves state-of-the-art whole-head MRI segmentation of 16 structures including extra-cerebral tissues by training on synthetic data from just six manual templates, matching or exceeding prior methods on 301 sca...
-
When Less is Enough: Efficient Inference via Collaborative Reasoning
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
-
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-poli...
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
- Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
- [2]
-
[3]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author=. 2025 , eprint=
work page 2025
-
[4]
Benchmarking Benchmark Leakage in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[5]
MathPile: A Billion-Token-Scale Pretraining Corpus for Math , author=. 2024 , eprint=
work page 2024
-
[6]
A Careful Examination of Large Language Model Performance on Grade School Arithmetic , author=. 2024 , eprint=
work page 2024
-
[7]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2024 , eprint=
work page 2024
- [8]
-
[9]
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=
work page 2023
-
[10]
arXiv preprint arXiv:2402.12219 , year=
Reformatted alignment , author=. arXiv preprint arXiv:2402.12219 , year=
-
[11]
arXiv preprint arXiv:2410.18982 , year=
O1 Replication Journey: A Strategic Progress Report--Part 1 , author=. arXiv preprint arXiv:2410.18982 , year=
-
[12]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. 2024 , eprint=
work page 2024
-
[13]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. ArXiv , year=
-
[14]
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. ArXiv , year=
- [15]
- [16]
-
[17]
PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=
work page 2022
-
[18]
Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=
work page 2023
-
[19]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=
work page 2023
- [20]
- [21]
-
[22]
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning , author=. ArXiv , year=
-
[23]
arXiv preprint arXiv:2406.12753 , year=
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI , author=. arXiv preprint arXiv:2406.12753 , year=
-
[24]
arXiv preprint arXiv:2406.16772 , year=
OlympicArena medal ranks: Who is the most intelligent AI so far? , author=. arXiv preprint arXiv:2406.16772 , year=
-
[25]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Advances in Neural Information Processing Systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
Reinforcement learning: An introduction , author=. 2018 , publisher=
work page 2018
-
[28]
arXiv preprint arXiv:1909.09031 , year=
Argumentative relation classification as plausibility ranking , author=. arXiv preprint arXiv:1909.09031 , year=
-
[29]
HMEAE: Hierarchical modular event argument extraction , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=
work page 2019
- [30]
-
[31]
Richard Yuanzhe Pang and Weizhe Yuan and Kyunghyun Cho and He He and Sainbayar Sukhbaatar and Jason Weston , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.19733 , eprinttype =. 2404.19733 , timestamp =
-
[32]
The Twelfth International Conference on Learning Representations,
Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[33]
Forty-first International Conference on Machine Learning,
Weizhe Yuan and Richard Yuanzhe Pang and Kyunghyun Cho and Xian Li and Sainbayar Sukhbaatar and Jing Xu and Jason Weston , title =. Forty-first International Conference on Machine Learning,. 2024 , url =
work page 2024
-
[34]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =
Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , booktitle =. 2023 , url =
work page 2023
-
[36]
Tianhao Wu and Weizhe Yuan and Olga Golovneva and Jing Xu and Yuandong Tian and Jiantao Jiao and Jason Weston and Sainbayar Sukhbaatar , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.19594 , eprinttype =. 2407.19594 , timestamp =
-
[37]
Ting Wu and Xuefeng Li and Pengfei Liu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.05013 , eprinttype =. 2407.05013 , timestamp =
-
[38]
A Tale of Tails: Model Collapse as a Change of Scaling Laws , author=. 2024 , eprint=
work page 2024
-
[39]
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data , author=. 2024 , eprint=
work page 2024
-
[40]
The Curse of Recursion: Training on Generated Data Makes Models Forget , author=. 2024 , eprint=
work page 2024
-
[41]
Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss , author=. 2024 , eprint=
work page 2024
-
[42]
The Twelfth International Conference on Learning Representations,
Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
- [43]
-
[44]
Simulated multiple reference training improves low-resource machine translation
Khayrallah, Huda and Thompson, Brian and Post, Matt and Koehn, Philipp. Simulated multiple reference training improves low-resource machine translation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.7
-
[45]
Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation
Zheng, Renjie and Ma, Mingbo and Huang, Liang. Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1357
-
[46]
REALM: Retrieval-Augmented Language Model Pre-Training
Realm: Retrieval-augmented language model pre-training , author=. arXiv preprint arXiv:2002.08909 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[47]
International Conference on Learning Representations , year=
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. International Conference on Learning Representations , year=
-
[48]
Kai Ming Ting and Ian H. Witten , title =. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence,. 1997 , url =
work page 1997
-
[49]
Artificial intelligence , volume=
Ensembling neural networks: many could be better than all , author=. Artificial intelligence , volume=. 2002 , publisher=
work page 2002
-
[50]
Proceedings of 5th International Joint Conference on Natural Language Processing , pages=
Generalized minimum bayes risk system combination , author=. Proceedings of 5th International Joint Conference on Natural Language Processing , pages=
-
[51]
Convolutional Neural Networks for Sentence Classification , author=. EMNLP , year=
-
[52]
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , booktitle =
Chelsea Finn and Pieter Abbeel and Sergey Levine , editor =. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , booktitle =. 2017 , url =
work page 2017
-
[53]
Jake Snell and Kevin Swersky and Richard S. Zemel , editor =. Prototypical Networks for Few-shot Learning , booktitle =. 2017 , url =
work page 2017
-
[54]
arXiv preprint arXiv:2003.08612 , year=
Enhancing factual consistency of abstractive summarization , author=. arXiv preprint arXiv:2003.08612 , year=
-
[55]
arXiv preprint arXiv:2003.13028 , year=
Abstractive summarization with combination of pre-trained sequence-to-sequence and saliency models , author=. arXiv preprint arXiv:2003.13028 , year=
-
[56]
Controllable Abstractive Summarization
Fan, Angela and Grangier, David and Auli, Michael. Controllable Abstractive Summarization. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018
work page 2018
-
[57]
Controlling Output Length in Neural Encoder-Decoders
Kikuchi, Yuta and Neubig, Graham and Sasano, Ryohei and Takamura, Hiroya and Okumura, Manabu. Controlling Output Length in Neural Encoder-Decoders. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1140
-
[58]
ICLR Workshop on Deep Reinforcement Learning for Structured Prediction , year=
Multi-agent query reformulation: Challenges and the role of diversity , author=. ICLR Workshop on Deep Reinforcement Learning for Structured Prediction , year=
-
[59]
Vakulenko, Svitlana and Longpre, Shayne and Tu, Zhucheng and Anantha, Raviteja. A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering. Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI). 2020. doi:10.18653/v1/2020.scai-1.2
-
[60]
INTERFACILE : Linguistic Coverage and Query Reformulation
Mathieu, Yvette and Sabatier, Paul. INTERFACILE : Linguistic Coverage and Query Reformulation. Coling 1986 Volume 1: The 11th International Conference on Computational Linguistics. 1986
work page 1986
-
[61]
Identifying Web Search Query Reformulation using Concept based Matching
Hassan, Ahmed. Identifying Web Search Query Reformulation using Concept based Matching. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013
work page 2013
-
[62]
Das, Manirupa and Fosler-Lussier, Eric and Lin, Simon and Moosavinasab, Soheil and Chen, David and Rust, Steve and Huang, Yungui and Ramnath, Rajiv. P hrase2 V ec GLM : Neural generalized language model -- based semantic tagging for complex query reformulation in medical IR. Proceedings of the B io NLP 2018 workshop. 2018. doi:10.18653/v1/W18-2313
-
[63]
Web Search Intent Induction via Automatic Query Reformulation
Daum \'e III, Hal and Brill, Eric. Web Search Intent Induction via Automatic Query Reformulation. Proceedings of HLT - NAACL 2004: Short Papers. 2004
work page 2004
-
[64]
Task-Oriented Query Reformulation with Reinforcement Learning
Task-oriented query reformulation with reinforcement learning , author=. arXiv preprint arXiv:1704.04572 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
Ask the right questions: Active question reformulation with reinforcement learning , author=. arXiv preprint arXiv:1705.07830 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
C oref QA : Coreference Resolution as Query-based Span Prediction
Wu, Wei and Wang, Fei and Yuan, Arianna and Wu, Fei and Li, Jiwei. C oref QA : Coreference Resolution as Query-based Span Prediction. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.622
-
[67]
A Unified MRC Framework for Named Entity Recognition
Li, Xiaoya and Feng, Jingrong and Meng, Yuxian and Han, Qinghong and Wu, Fei and Li, Jiwei. A Unified MRC Framework for Named Entity Recognition. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.519
-
[68]
Solving math word problems with process-and outcome-based feedback , url =
Uesato, Jonathan and Kushman, Nate and Kumar, Ramana and Song, Francis and Siegel, Noah and Wang, Lisa and Creswell, Antonia and Irving, Geoffrey and Higgins, Irina , journal =. Solving math word problems with process-and outcome-based feedback , url =
-
[69]
Evaluating Mathematical Reasoning Beyond Accuracy , url =
Xia, Shijie and Li, Xuefeng and Liu, Yixin and Wu, Tongshuang and Liu, Pengfei , journal =. Evaluating Mathematical Reasoning Beyond Accuracy , url =
-
[70]
Easy-to-hard generalization: Scalable alignment beyond human supervision , url =
Sun, Zhiqing and Yu, Longhui and Shen, Yikang and Liu, Weiyang and Yang, Yiming and Welleck, Sean and Gan, Chuang , journal =. Easy-to-hard generalization: Scalable alignment beyond human supervision , url =
-
[71]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations , year =
Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , booktitle =. Math-shepherd: Verify and reinforce llms step-by-step without human annotations , year =
-
[72]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision , url =
Luo, Liangchen and Liu, Yinxiao and Liu, Rosanne and Phatale, Samrat and Lara, Harsh and Li, Yunxuan and Shu, Lei and Zhu, Yun and Meng, Lei and Sun, Jiao and others , journal =. Improve Mathematical Reasoning in Language Models by Automated Process Supervision , url =
-
[73]
Hao, Shibo and Gu, Yi and Luo, Haotian and Liu, Tianyang and Shao, Xiyan and Wang, Xinyuan and Xie, Shuhua and Ma, Haodi and Samavedhi, Adithya and Gao, Qiyue and others , journal =. LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models , url =
-
[74]
AlphaMath Almost Zero: process Supervision without process , url =
Chen, Guoxin and Liao, Minpeng and Li, Chengxi and Fan, Kai , journal =. AlphaMath Almost Zero: process Supervision without process , url =
-
[75]
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , url =
Wang, Chaojie and Deng, Yanchen and Lv, Zhiyi and Yan, Shuicheng and Bo, An , journal =. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , url =
-
[76]
Mastering the game of Go with deep neural networks and tree search , volume =
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal =. Mastering the game of Go with deep neural networks and tree search , volume =
-
[77]
GitHub repository , howpublished =
Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[78]
Zhongshen Zeng and Pengguang Chen and Shu Liu and Haiyun Jiang and Jiaya Jia , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.17080 , eprinttype =. 2312.17080 , biburl =
-
[79]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Advances in Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
-
[81]
Chain-of-thought prompting elicits reasoning in large language models , volume =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and others , journal =. Chain-of-thought prompting elicits reasoning in large language models , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.